[Cryptech Core] rsa timing tests
Paul Selkirk
paul at psgd.org
Sat Jan 12 23:36:42 UTC 2019
Warning, this is a little long and a lot wonky.
After gathering all of this profiling data, I still wasn't
satisfied. Sampling the instruction pointer once every millisecond
doesn't give very granular results. So I decided to directly time each
of the component functions involved in RSA signing.
I compared the same three models as in the profiling tests - master
(baseline shipping code), keywrap (using Joachim's keywrap core), and
60mhz (fmc_clk_60mhz bitstream and firmware), plus a combination 60mhz
bitstream with the keywrap core.
Additionally, I ran the same tests in a cli-test firmware, which
doesn't have a tasker, to measure the effects of hal_task_yield.
;tldr: Fetching and unwrapping the key is a lot faster than profiling
indicated. Which means that modexp dominates signing time.
fmc_clk_60mhz and the keywrap core combine to a nice speedup.
Having a "CRT core" is probably the next big goal.
The actual test methodology was very simple - call each function a
number of times (generally 1000), and measure the elapsed time
(reported here in milliseconds/call). Joachim pointed me at the DWT
(Data Watchpoint and Trace) unit on the ARM, and that may be useful in
the future, but this was sufficient for first-level testing.
The general flow of a test is to load the key into the keystore,
generate some signatures, and delete the key. There's an optional CLI
argument to keep the key in the keystore between tests, which is handy
later.
A warning that the test code is pretty ugly, because I'm testing
low-level functions, and need to do a lot of setup that would normally
be done by the calling function. The test code is committed on a
branch, which should never be merged to master.
Anyway, on to the tests...
First, I used libhal/tests/parallel-signatures.py to measure signing
time from the user's perspective, then compared it to calling
hal_rpc_pkey_sign on the device, to determine how much time was spent
just talking to the device.
master keywrap 60mhz 60+wrap
parallel-signatures 176.247 149.324 146.291 135.126
RPC, mux, libhal.py, test overhead 33.454 23.837 23.580 23.900
The first time a key is used, blinding factors are generated for it.
The blinding factors are cached separately from the key, so the same
factors are used even if the key is deleted and reloaded into the
keystore.
To measure blinding time, I generated one signature, deleted and
reloaded the key, and generated another signature. The difference is
entirely due to the blinding factors. To avoid having to constantly
reboot the device, I added a function and CLI command to clear the
blinding cache.
There are also some fields that are calculated the first time a key is
used, then stored with the key to speed up subsequent uses. Since they
are stored with the key, they are re-calculated if the key is deleted
and reloaded. This means the first signature takes substantially longer
than the second.
To measure precalc time, I first determined the time to generate one
signature (1 precalc + 1 sign), then the time to generate 100
(1 precalc + 100 sign), then a little bit of algebra to tease out the
signing and precalc times.
blinding 820.00 812.48 701.30 696.04
precalc 116.703 107.272 98.292 92.051
signing 142.797 125.448 122.708 111.229
I also added an optional CLI argument to keep the key in the keystore,
rather than continually loading and deleting it, so that I could more
directly measure signing time without the precalc.
Happily, timing hal_rpc_pkey_sign in this way matches pretty closely
to the calculations above.
hal_rpc_pkey_sign 142.793 125.440 122.711 111.226
Then I started drilling down into the functions called by
hal_rpc_pkey_sign - fetching the key, and generating the signature.
hal_ks_fetch 21.501 4.265 14.172 3.266
hal_mkm_get_kek 0.717 0.717 0.599 0.599
hal_aes_keyunwrap 20.655 3.541 13.450 2.661
pkey_local_sign_rsa 121.120 107.855
hal_rsa_private_key_from_der 0.627 0.627
hal_rsa_decrypt 120.935 107.698
rsa_crt 117.010 103.712
modexp2 90.440 77.217
hal_modexp2 79.931 66.707
(Note that the signing numbers are missing from the second and fourth
columns because they are the same as the first and third - these
models differ only in the way keys are unwrapped.)
Finally, I ran all the same tests in the cli-test project. Cli-test
doesn't have the hsm RPC interface, and doesn't use the tasker. So,
for the most part, the differences between this and the previous table
is due to calling hal_task_yield in hal_io_wait. In these tests, only
one RPC task was active, and the CLI task would be in a sleep state
while the test is running, so task_yield involves looking through a
5-element circular list, determining that no other task is runnable,
and returning to the test.
cli-test: master keywrap 60mhz 60+wrap
blinding 742.88 735.30 622.88 619.24
precalc 111.661 103.469 92.843 88.242
signing 122.179 106.431 98.438 91.879
hal_rpc_pkey_sign 122.183 106.433 98.444 91.868
hal_ks_fetch 19.920 4.172 9.631 3.057
hal_mkm_get_kek 0.703 0.703 0.580 0.580
hal_aes_keyunwrap 19.186 3.437 9.021 2.447
pkey_local_sign_rsa 102.181 88.732
hal_rsa_private_key_from_der 0.332 0.332
hal_rsa_decrypt 101.842 88.394
rsa_crt 98.930 85.480
modexp2 88.114 74.698
hal_modexp2 79.812 66.393
Initial comparison to profiling results:
> First, recall that signing is roughly evenly divided between fetching
> the key and generating the signature.
Surprisingly, not even close.
> So hal_ks_fetch is dominated by hal_aes_keyunwrap, which is almost
> entirely communication with the AES core. It's also by far the biggest
> contributor to time spent in hal_io_wait (and hal_io_wait is the
> biggest contributor to hal_io_read).
That's correct as far as it goes. hal_ks_fetch is dominated by
hal_aes_keyunwrap. And hal_mkm_get_kek is, in fact, pretty zippy,
because it's pure I/O, with no computation. But I think it's the
hal_io_wait and task_yield that makes it appear to take so much more
time in profiling.
> Adding all this up, pkey_local_sign_rsa is 44.86s, of which 5.33s is
> ASN.1 decoding, only 0.68s is hal_modexp2, and a whopping 38.38s is
> libtfm (the rest of CRT, and blinding factor permutation; featuring
> fp_mulmod, unpack_fp, fp_sqrmod, fp_read_unsigned_bin, fp_mul,
> fp_unsigned_bin_size, in order of contribution).
Yeah, no, that's way off. Way off. Modexp dominates signing. Libtfm
does take some time, but not as much as it appeared, e.g. 27ms in
rsa_crt for the non-modexp portion of CRT, and 10ms in modexp2 to read
bignums into byte arrays, and to zero them afterwards.
Some of the new numbers don't quite add up either, like
hal_rsa_private_key_from_der + hal_rsa_decrypt > pkey_local_sign_rsa.
I don't know if it's a test-and-branch issue or what.
Speaking of hal_rsa_private_key_from_der, there's a significant
difference between the hsm and cli-test runs. This function is purely
software, with no FPGA interaction, so there's no hal_io_wait or
hal_task_yield happening here.
The best explanation I can come up with is that, in the cli-test
project, the stack is in the on-chip SRAM, while in the hsm project,
the cli stack is in slower off-chip SDRAM (because libtfm uses a lot
of stack, and the main SRAM is full of other stuff). Every other
function tested is mostly about FPGA interaction, so memory speed
is a negligible factor.
Anyway, that's what I've been working on. If you made it this far,
thanks for reading. If anything's unclear, just holler; I've been
soaking in this stuff for long enough that I may have forgotten to
mention things that are obvious to me.
paul
More information about the Core
mailing list