[Cryptech Tech] Status of RSA timing tests etc
Rob Austein
sra at hactrn.net
Sat May 26 17:30:25 UTC 2018
Status report to get everybody onto the same page:
* Joachim has done a lot of work on making a faster AES core.
* Integrating that faster AES core into the alpha build does not RSA
signatures noticeably faster. We don't know why. Presumably this
means that the current bottleneck is not really the AES core speed,
but we don't know what the bottleneck is. We're still happy to have
the faster AES core, that's good stuff, it's just not currently
solving this particular problem.
* One of the things we tried as an experiment was halving the number
of cores (so down from 8 modexp and 4 AES to 4 modexp and 2 AES), in
case this was some kind of artifact of pushing the limits of what we
can fit on this FPGA. Nope, no noticeable change.
* Overall signature throughput does go up as we add more clients until
we hit the limit of how many cores are in the build, but it's not
linear: two clients is nearly twice the throughput of one client;
each additional client after that adds something, but a smaller
something, to the overall throughput. This again tends to suggest
that some shared resource is the real bottleneck.
* Several of us are still suspicious of FMC I/O here, among other
reasons because the ARM spends a lot of time in the FMC I/O code
during the do_block() function which keeps showing up as the 800kg
gorilla in the profiling results.
* Paul (re-)raised an interesting question about whether we could
clock the FPGA from the 90MHz FMC bus and make this a synchronous
interface, which at least in theory might significantly increase
throughput on the bus. Might at least be worth the experiment,
particularly if it:
* Lets us get rid of the double read in fmc_read() and
* Lets us stop having to spin wait polling the FMC NWAIT GPIO pin
after every 32 bit word we transfer(!).
Assuming this change is workable, if we were to combine it with the
trivial hack to move the bytes wapping to Verilog, we'd finally have
an interface which looks relatively sane in terms of the number of
ARM instructions involved in moving data between ARM and FPGA.
Paul and I don't know enough about the FMC bus to know whether this
is plausible. Pavel? You're our FMC expert. Opinions?
* I noticed a few more minor things we could do with the current FMC
I/O code which might squeeze a few more cycles out of it, will try
them at some point, but as long as we have the NWAIT poll after
every word I would not expect this to make much difference.
Throughput numbers from current tests, with profiling disabled (would
be significantly slower with profiling). This is for two different
bitstreams, identical other than the numbers of aes_fast and modexpa7
cores. "n" is the number of signatures per test, "c" is the number of
clients in that test.
# Testing with two aes_speed cores and four modexpa7 cores
rsa_2048 sigs/sec 5.67867374529 secs/sig 0:00:00.176097 mean 0:00:00.175575 (n 1000, c 1 t0 2018-05-25 00:48:08.021713 t1 2018-05-25 00:51:04.119169)
rsa_2048 sigs/sec 8.91530052613 secs/sig 0:00:00.112166 mean 0:00:00.223732 (n 1000, c 2 t0 2018-05-25 00:45:14.212604 t1 2018-05-25 00:47:06.379322)
# Testing with four aes_speed cores and eight modexpa7 cores
rsa_2048 sigs/sec 5.67869635076 secs/sig 0:00:00.176096 mean 0:00:00.175576 (n 1000, c 1 t0 2018-05-25 00:58:16.127982 t1 2018-05-25 01:01:12.224737)
rsa_2048 sigs/sec 8.91534805697 secs/sig 0:00:00.112166 mean 0:00:00.223737 (n 1000, c 2 t0 2018-05-25 00:54:39.331429 t1 2018-05-25 00:56:31.497549)
rsa_2048 sigs/sec 10.5404984184 secs/sig 0:00:00.094872 mean 0:00:00.283858 (n 1000, c 3 t0 2018-05-25 01:01:49.373243 t1 2018-05-25 01:03:24.245417)
rsa_2048 sigs/sec 10.9159940249 secs/sig 0:00:00.091608 mean 0:00:00.365467 (n 1000, c 4 t0 2018-05-25 01:04:11.440790 t1 2018-05-25 01:05:43.049488)
More information about the Tech
mailing list