[Cryptech Tech] ModExpNG timing results

Paul Selkirk paul at psgd.org
Wed Feb 19 02:26:12 UTC 2020


Warning, this is a little long and a lot wonky.

I did some timing tests to compare the RSA signing performance of
Pavel's "old" ModExpA7 core (in the dual-core configuration we use for
CRT) versus his "new" ModExpNG core.

First I built a bitstream with both cores, at 60MHz bus and core speed.
Then I hacked rsa.c and modexp.c to support both cores, and added a
function hal_modexp_use_modexpng() and CLI command `rsa modexpng on/off`
to switch between them. Finally I added a bunch of instrumentation,
using the ARM's cycle-counting facility, to get fine-grained execution
timing.

This is all committed on branch 'modexpng' the following repos:
  core/platform/common
  core/platform/alpha
  sw/libhal
  sw/stm32

The following tables are side-by-side comparisons of the two cores,
signing the same message with the same 2048-bit key. All times are in
milliseconds.

The first table is for bulk signing, i.e. the key has been used before,
so the blinding factors have been calculated, as well as the modulus
factor and Montgomery coefficient that the core uses to speed up its
calculations.

The first line is the result of running
libhal/tests/parallel-signatures.py against the board. (For this, I took
the median of 1000 signatures, since the mean would include the first
signature, which would include the blinding factor and precalc times.)
The third line is the result of calling hal_rpc_pkey_sign from the CLI,
and is the mean of 1000 or so runs, after priming it once or twice.
Therefore, the second line is the overhead of the serial RPC mechanism.

Note that this version does blinding factor mutation in software for
both cores. The modexpng core does this for free, but would require some
code changes to read out from the core, so this is an area where
modexpng could end up even faster.

Note also that message blinding/unblinding is done in software for
modexpa7, but in hardware for modexpng.

Finally note that there is a measurable penalty for unpacking libtfm
fp_ints (little-endian structs) out to big-endian bytestrings, so that
they can be fed to little-endian modexp cores. We could win some cycles
by copying little-endian to little-endian, if we are willing to tie
libhal to a specific bignum model and a specific core model, which is
actually what we're doing in the driver code already.

                                        modexpa7        modexpng
parallel-signatures                     149.597         109.626
RPC overhead                             24.034          23.919
hal_rpc_pkey_sign                       125.563          85.707
  hal_ks_fetch                           17.303          17.303
    hal_mkm_get_kek                       0.627           0.627
    hal_aes_keyunwrap                    16.596          16.596
  pkey_local_sign_rsa                   108.177          68.323
    hal_rsa_private_key_from_der          0.687           0.687
    hal_rsa_decrypt                     107.474          67.628
      rsa_crt                           103.724          63.878
        blinding factor mutation         11.460          11.460
        blind/unblind message            12.039
        modexp2/ng                       77.428          52.398
          unpack_fp                      10.422          26.769
          hal_modexp2/ng                 66.920          25.550

The second table is all about the first signature, where we calculate
the blinding factors, the modulus coefficient, and the Montgomery
factor. The blinding factors require doing a modexp, which requires
calculating the factors for N, while the signing requires calculating
the factors for P and Q.

For the modexpa7 core, the "precalc" is done in hardware; for the
modexpng core, it's done in software.

                                        modexpa7        modexpng
  pkey_local_sign_rsa                   910.379         620.545
    hal_rsa_decrypt                     875.589         585.810
      rsa_crt                           871.767         582.056
        create_blinding_factors         706.332         378.762
          modexp                        585.381         257.834
            precalc N                                   246.373
            hal_modexp/ng               577.831           3.941
              precalc N                 570.390
              (rest of modexp)            7.441
        modexp2/ng                      150.675         203.276
          unpack_fp                      10.422          26.769
          precalc P/Q                                   150.878
          hal_modexp2/ng                140.167          25.549
            precalc P/Q                  73.118

In conclusion, ModExpNG is significantly faster, and it's worth
switching over completely to it.

Also, thanks to Joachim for pointing me at the Cortex-M4's DWT function.
It really made it a lot easier to measure execution time, with a higher
degree of confidence.

				paul


More information about the Tech mailing list