[Cryptech Tech] Status of RSA timing tests etc

Rob Austein sra at hactrn.net
Sat May 26 17:30:25 UTC 2018


Status report to get everybody onto the same page:

* Joachim has done a lot of work on making a faster AES core.

* Integrating that faster AES core into the alpha build does not RSA
  signatures noticeably faster.  We don't know why.  Presumably this
  means that the current bottleneck is not really the AES core speed,
  but we don't know what the bottleneck is.  We're still happy to have
  the faster AES core, that's good stuff, it's just not currently
  solving this particular problem.

* One of the things we tried as an experiment was halving the number
  of cores (so down from 8 modexp and 4 AES to 4 modexp and 2 AES), in
  case this was some kind of artifact of pushing the limits of what we
  can fit on this FPGA.  Nope, no noticeable change.

* Overall signature throughput does go up as we add more clients until
  we hit the limit of how many cores are in the build, but it's not
  linear: two clients is nearly twice the throughput of one client;
  each additional client after that adds something, but a smaller
  something, to the overall throughput.  This again tends to suggest
  that some shared resource is the real bottleneck.

* Several of us are still suspicious of FMC I/O here, among other
  reasons because the ARM spends a lot of time in the FMC I/O code
  during the do_block() function which keeps showing up as the 800kg
  gorilla in the profiling results.

* Paul (re-)raised an interesting question about whether we could
  clock the FPGA from the 90MHz FMC bus and make this a synchronous
  interface, which at least in theory might significantly increase
  throughput on the bus.  Might at least be worth the experiment,
  particularly if it:

  * Lets us get rid of the double read in fmc_read() and
  
  * Lets us stop having to spin wait polling the FMC NWAIT GPIO pin
    after every 32 bit word we transfer(!).

  Assuming this change is workable, if we were to combine it with the
  trivial hack to move the bytes wapping to Verilog, we'd finally have
  an interface which looks relatively sane in terms of the number of
  ARM instructions involved in moving data between ARM and FPGA.

  Paul and I don't know enough about the FMC bus to know whether this
  is plausible.  Pavel?  You're our FMC expert.  Opinions?

* I noticed a few more minor things we could do with the current FMC
  I/O code which might squeeze a few more cycles out of it, will try
  them at some point, but as long as we have the NWAIT poll after
  every word I would not expect this to make much difference.

Throughput numbers from current tests, with profiling disabled (would
be significantly slower with profiling).  This is for two different
bitstreams, identical other than the numbers of aes_fast and modexpa7
cores.  "n" is the number of signatures per test, "c" is the number of
clients in that test.

# Testing with two aes_speed cores and four modexpa7 cores
rsa_2048 sigs/sec 5.67867374529 secs/sig 0:00:00.176097 mean 0:00:00.175575 (n 1000, c 1 t0 2018-05-25 00:48:08.021713 t1 2018-05-25 00:51:04.119169)
rsa_2048 sigs/sec 8.91530052613 secs/sig 0:00:00.112166 mean 0:00:00.223732 (n 1000, c 2 t0 2018-05-25 00:45:14.212604 t1 2018-05-25 00:47:06.379322)

# Testing with four aes_speed cores and eight modexpa7 cores
rsa_2048 sigs/sec 5.67869635076 secs/sig 0:00:00.176096 mean 0:00:00.175576 (n 1000, c 1 t0 2018-05-25 00:58:16.127982 t1 2018-05-25 01:01:12.224737)
rsa_2048 sigs/sec 8.91534805697 secs/sig 0:00:00.112166 mean 0:00:00.223737 (n 1000, c 2 t0 2018-05-25 00:54:39.331429 t1 2018-05-25 00:56:31.497549)
rsa_2048 sigs/sec 10.5404984184 secs/sig 0:00:00.094872 mean 0:00:00.283858 (n 1000, c 3 t0 2018-05-25 01:01:49.373243 t1 2018-05-25 01:03:24.245417)
rsa_2048 sigs/sec 10.9159940249 secs/sig 0:00:00.091608 mean 0:00:00.365467 (n 1000, c 4 t0 2018-05-25 01:04:11.440790 t1 2018-05-25 01:05:43.049488)


More information about the Tech mailing list