[Cryptech Tech] Happier RSA timing numbers

Rob Austein sra at hactrn.net
Fri May 18 16:05:39 UTC 2018


On Fri, 18 May 2018 02:24:12 -0400, Joachim Strömbergson wrote:
...
> If you could do more analysis on where the bottleneck in the
> hal_aes_keyunwrap() really is (AES processing, transfer of data, cipher
> mode SW processing) we can see what we should change, add.

I've attached the log from a profiling run.  This one is about a month
old, I can generate a new one at some point if needed but I don't
think anything has changed in the AES keywrap code since then.  You
can generate these yourself these days if you have an ST-LINK, Paul
merged the profiling code into the master branch in sw/stm32.

Anyway: this is gprof output, which is relatively self explanatory,
but in caes you haven't encountered it before you will probably want
to start with the first dozen or so entries in the call graph (which
is not the first section).  To my mind, the most interesting entries
in the call graph are:

[5] which shows that we're spending nearly twice as much time fetching
    from the keystore as we are doing RSA signature per se;

[6] which shows that the keystore fetch is almost entirely AES
    keyunwrap and waiting for another pseudothread to release the
    keystore (ie, more AES keyunwrap, just for another client);

[7] which shows that AES keyunwrap in turn is mostly about the AES
    core;

[12] and [13] which show some weird stuff I hadn't previously noticed
    about RSA blinding factors (more on this below).

I think that [7] answers your question to the extent that the current
profiling code is going to be able to do so: out of the 258 total
seconds we spent fetching from the keystore, we spent about 20 on FMC
write, 18 on FMC read, and 98 twiddling our thumbs waiting for the AES
core.

You will notice that the numbers don't quite add up.  Welcome to
profiling, particularly with even non-preemptive threading.  Even at a
philosophical level I don't know how to account for time spent waiting
for a lock because something else is the same scarce resource we want:
the lock itself is cheap, but having all four clients in effect
waiting for a single core makes the numbers come out funny.

I don't recall how many AES cores were in the bitstream for this
particular test.  I seem to recall having tried bumping the number of
AES cores without noticable effect, but would have to re-run the
experiment if we wanted any real details.  In theory this could be
caused by holding the lock for too wide a scope, so I'd want to check
whether we're actually holding the keystore lock while unwrapping (ran
out of time while composing this message so no time to read the code
right now to answer that question, sorry).

The blinding factor stuff in [12] and [13] is unexpected: if I'm
reading this correctly, it says that modular inversion during blinding
factor creation is more expensive than signature.  I'm not sure I
believe this, but this does suggest another optimization to pursue.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: profiler.out
Type: application/octet-stream
Size: 165587 bytes
Desc: not available
URL: <https://lists.cryptech.is/archives/tech/attachments/20180518/139c1473/attachment-0001.obj>


More information about the Tech mailing list