[Cryptech Core] rsa timing tests

Paul Selkirk paul at psgd.org
Sat Jan 12 23:36:42 UTC 2019
Previous message (by thread): [Cryptech Core] Face-to-face Amsterdam 19 & 20 Feb
Next message (by thread): [Cryptech Core] Amsterdam hotel
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Warning, this is a little long and a lot wonky.

After gathering all of this profiling data, I still wasn't
satisfied. Sampling the instruction pointer once every millisecond
doesn't give very granular results. So I decided to directly time each
of the component functions involved in RSA signing.

I compared the same three models as in the profiling tests - master
(baseline shipping code), keywrap (using Joachim's keywrap core), and
60mhz (fmc_clk_60mhz bitstream and firmware), plus a combination 60mhz
bitstream with the keywrap core.

Additionally, I ran the same tests in a cli-test firmware, which
doesn't have a tasker, to measure the effects of hal_task_yield.

;tldr: Fetching and unwrapping the key is a lot faster than profiling
indicated. Which means that modexp dominates signing time.
fmc_clk_60mhz and the keywrap core combine to a nice speedup.
Having a "CRT core" is probably the next big goal.

The actual test methodology was very simple - call each function a
number of times (generally 1000), and measure the elapsed time
(reported here in milliseconds/call). Joachim pointed me at the DWT
(Data Watchpoint and Trace) unit on the ARM, and that may be useful in
the future, but this was sufficient for first-level testing.

The general flow of a test is to load the key into the keystore,
generate some signatures, and delete the key. There's an optional CLI
argument to keep the key in the keystore between tests, which is handy
later.

A warning that the test code is pretty ugly, because I'm testing
low-level functions, and need to do a lot of setup that would normally
be done by the calling function. The test code is committed on a
branch, which should never be merged to master.

Anyway, on to the tests...

First, I used libhal/tests/parallel-signatures.py to measure signing
time from the user's perspective, then compared it to calling
hal_rpc_pkey_sign on the device, to determine how much time was spent
just talking to the device.

                                     master  keywrap    60mhz  60+wrap
parallel-signatures                 176.247  149.324  146.291  135.126
RPC, mux, libhal.py, test overhead   33.454   23.837   23.580   23.900

The first time a key is used, blinding factors are generated for it.
The blinding factors are cached separately from the key, so the same
factors are used even if the key is deleted and reloaded into the
keystore.

To measure blinding time, I generated one signature, deleted and
reloaded the key, and generated another signature. The difference is
entirely due to the blinding factors. To avoid having to constantly
reboot the device, I added a function and CLI command to clear the
blinding cache.

There are also some fields that are calculated the first time a key is
used, then stored with the key to speed up subsequent uses. Since they
are stored with the key, they are re-calculated if the key is deleted
and reloaded. This means the first signature takes substantially longer
than the second.

To measure precalc time, I first determined the time to generate one
signature (1 precalc + 1 sign), then the time to generate 100
(1 precalc + 100 sign), then a little bit of algebra to tease out the
signing and precalc times.

blinding                            820.00   812.48   701.30   696.04
precalc                             116.703  107.272   98.292   92.051
signing                             142.797  125.448  122.708  111.229

I also added an optional CLI argument to keep the key in the keystore,
rather than continually loading and deleting it, so that I could more
directly measure signing time without the precalc.

Happily, timing hal_rpc_pkey_sign in this way matches pretty closely
to the calculations above.

hal_rpc_pkey_sign                   142.793  125.440  122.711  111.226

Then I started drilling down into the functions called by
hal_rpc_pkey_sign - fetching the key, and generating the signature.

    hal_ks_fetch                     21.501    4.265   14.172    3.266
      hal_mkm_get_kek                 0.717    0.717    0.599    0.599
      hal_aes_keyunwrap              20.655    3.541   13.450    2.661

    pkey_local_sign_rsa             121.120           107.855
      hal_rsa_private_key_from_der    0.627             0.627
      hal_rsa_decrypt               120.935           107.698
        rsa_crt                     117.010           103.712
          modexp2                    90.440            77.217
            hal_modexp2              79.931            66.707

(Note that the signing numbers are missing from the second and fourth
columns because they are the same as the first and third - these
models differ only in the way keys are unwrapped.)

Finally, I ran all the same tests in the cli-test project. Cli-test
doesn't have the hsm RPC interface, and doesn't use the tasker. So,
for the most part, the differences between this and the previous table
is due to calling hal_task_yield in hal_io_wait. In these tests, only
one RPC task was active, and the CLI task would be in a sleep state
while the test is running, so task_yield involves looking through a
5-element circular list, determining that no other task is runnable,
and returning to the test.

cli-test:                            master  keywrap    60mhz  60+wrap
blinding                            742.88   735.30   622.88   619.24
precalc                             111.661  103.469   92.843   88.242
signing                             122.179  106.431   98.438   91.879

hal_rpc_pkey_sign                   122.183  106.433   98.444   91.868

    hal_ks_fetch                     19.920    4.172    9.631    3.057
      hal_mkm_get_kek                 0.703    0.703    0.580    0.580
      hal_aes_keyunwrap              19.186    3.437    9.021    2.447

    pkey_local_sign_rsa             102.181            88.732
      hal_rsa_private_key_from_der    0.332             0.332
      hal_rsa_decrypt               101.842            88.394
        rsa_crt                      98.930            85.480
          modexp2                    88.114            74.698
            hal_modexp2              79.812            66.393

Initial comparison to profiling results:

> First, recall that signing is roughly evenly divided between fetching
> the key and generating the signature.

Surprisingly, not even close.

> So hal_ks_fetch is dominated by hal_aes_keyunwrap, which is almost
> entirely communication with the AES core. It's also by far the biggest
> contributor to time spent in hal_io_wait (and hal_io_wait is the
> biggest contributor to hal_io_read).

That's correct as far as it goes. hal_ks_fetch is dominated by
hal_aes_keyunwrap. And hal_mkm_get_kek is, in fact, pretty zippy,
because it's pure I/O, with no computation. But I think it's the
hal_io_wait and task_yield that makes it appear to take so much more
time in profiling.

> Adding all this up, pkey_local_sign_rsa is 44.86s, of which 5.33s is
> ASN.1 decoding, only 0.68s is hal_modexp2, and a whopping 38.38s is
> libtfm (the rest of CRT, and blinding factor permutation; featuring
> fp_mulmod, unpack_fp, fp_sqrmod, fp_read_unsigned_bin, fp_mul,
> fp_unsigned_bin_size, in order of contribution).

Yeah, no, that's way off. Way off. Modexp dominates signing. Libtfm
does take some time, but not as much as it appeared, e.g. 27ms in
rsa_crt for the non-modexp portion of CRT, and 10ms in modexp2 to read
bignums into byte arrays, and to zero them afterwards.

Some of the new numbers don't quite add up either, like
hal_rsa_private_key_from_der + hal_rsa_decrypt > pkey_local_sign_rsa.
I don't know if it's a test-and-branch issue or what.

Speaking of hal_rsa_private_key_from_der, there's a significant
difference between the hsm and cli-test runs. This function is purely
software, with no FPGA interaction, so there's no hal_io_wait or
hal_task_yield happening here.

The best explanation I can come up with is that, in the cli-test
project, the stack is in the on-chip SRAM, while in the hsm project,
the cli stack is in slower off-chip SDRAM (because libtfm uses a lot
of stack, and the main SRAM is full of other stuff). Every other
function tested is mostly about FPGA interaction, so memory speed
is a negligible factor.

Anyway, that's what I've been working on. If you made it this far,
thanks for reading. If anything's unclear, just holler; I've been
soaking in this stuff for long enough that I may have forgotten to
mention things that are obvious to me.

				paul
Previous message (by thread): [Cryptech Core] Face-to-face Amsterdam 19 & 20 Feb
Next message (by thread): [Cryptech Core] Amsterdam hotel
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Core mailing list