[Cryptech Tech] Happier RSA timing numbers

Rob Austein sra at hactrn.net
Thu May 17 13:40:27 UTC 2018


Sadly, this turned out to be a false positive.  More precisely, the
measurements are probably accurate, but the results aren't useful.

Two eggs in the face for the price of one here:

1) Yes, of course the test script included a validation check of the
   signature result.  Unfortunately, that check was itself broken, so
   it didn't notice that the purported signature was nonsense.

2) Even without (1), the approach of bypassing the keystore would not
   have worked in its present form, because while Garner's Algorithm
   is not very complicated, it's not just modexp, so doing the modexp
   without touching the keystore would not have prevented us from
   needing to read from the keystore for the rest of the calculation.
   This should have been obvious from the start.

So it's going through all the motions, and given the constant time
design it's probably taking the same amount of time that it would if
this approach worked, but the some of the input is garbage.  GIGO.

Feh.

Going back to the original problem: profiling says that the main
bottleneck is the slow retrieval from the keystore: this appears to
take about twice as long as modexp, and the bulk of that time is spent
in hal_aes_keyunwrap(), specifically in the function which runs a
block through the AES core.  So this looks like an AES throughput
problem, at least given the way we're currently using the AES core.

It's possible that we could speed up the keystore a bit by using a
different wrapping algorithm, as Peter Gutmann has been suggesting all
along, although perhaps for reasons other than speed.  Eg, we could
use AES in one of the classic block cipher modes instead of AES
keywrap: presumably we'd need a MAC too, but at least in theory that
could run in parallel with the encryption, although the control code
to manage that might be a bit of nasty.

At some point, just for confirmation, I (or somebody) should generate
a test build with the AES keywrap code disabled (so, obviously, not to
be used for any purpose other than this test) so that we can find out
how fast the rest of the RSA signature path runs run without the
keywrap overhead, where the remaining bottlenecks are, etc.

Other alternatives, both of which boil down to tossing some part of
the problem over the wall to the Verilog folks:

* Build a real RSA CRT core, which would consist of two modexp cores
  whose outputs fed into the other operations needed for Garner's
  algorithm.  See the rsa_crt() function in sw/libhal/rsa.c for what
  this would need to do.

* Build a real AES keywrap core, and hope it runs significantly faster
  than the current approach of software driving AES ECB.

Some useful work did come out of all this, in particular I think we
understand the needs of the internal core management API better than
we did before I fell down this rabbit hole, and that code will end up
on the master branch eventually.  But the current approach isn't going
to deliver the hoped-for RSA speed.  Sorry for raising false hopes.


More information about the Tech mailing list