[Cryptech Tech] ModExpNG timing results
Paul Selkirk
paul at psgd.org
Wed Feb 19 02:26:12 UTC 2020
Warning, this is a little long and a lot wonky.
I did some timing tests to compare the RSA signing performance of
Pavel's "old" ModExpA7 core (in the dual-core configuration we use for
CRT) versus his "new" ModExpNG core.
First I built a bitstream with both cores, at 60MHz bus and core speed.
Then I hacked rsa.c and modexp.c to support both cores, and added a
function hal_modexp_use_modexpng() and CLI command `rsa modexpng on/off`
to switch between them. Finally I added a bunch of instrumentation,
using the ARM's cycle-counting facility, to get fine-grained execution
timing.
This is all committed on branch 'modexpng' the following repos:
core/platform/common
core/platform/alpha
sw/libhal
sw/stm32
The following tables are side-by-side comparisons of the two cores,
signing the same message with the same 2048-bit key. All times are in
milliseconds.
The first table is for bulk signing, i.e. the key has been used before,
so the blinding factors have been calculated, as well as the modulus
factor and Montgomery coefficient that the core uses to speed up its
calculations.
The first line is the result of running
libhal/tests/parallel-signatures.py against the board. (For this, I took
the median of 1000 signatures, since the mean would include the first
signature, which would include the blinding factor and precalc times.)
The third line is the result of calling hal_rpc_pkey_sign from the CLI,
and is the mean of 1000 or so runs, after priming it once or twice.
Therefore, the second line is the overhead of the serial RPC mechanism.
Note that this version does blinding factor mutation in software for
both cores. The modexpng core does this for free, but would require some
code changes to read out from the core, so this is an area where
modexpng could end up even faster.
Note also that message blinding/unblinding is done in software for
modexpa7, but in hardware for modexpng.
Finally note that there is a measurable penalty for unpacking libtfm
fp_ints (little-endian structs) out to big-endian bytestrings, so that
they can be fed to little-endian modexp cores. We could win some cycles
by copying little-endian to little-endian, if we are willing to tie
libhal to a specific bignum model and a specific core model, which is
actually what we're doing in the driver code already.
modexpa7 modexpng
parallel-signatures 149.597 109.626
RPC overhead 24.034 23.919
hal_rpc_pkey_sign 125.563 85.707
hal_ks_fetch 17.303 17.303
hal_mkm_get_kek 0.627 0.627
hal_aes_keyunwrap 16.596 16.596
pkey_local_sign_rsa 108.177 68.323
hal_rsa_private_key_from_der 0.687 0.687
hal_rsa_decrypt 107.474 67.628
rsa_crt 103.724 63.878
blinding factor mutation 11.460 11.460
blind/unblind message 12.039
modexp2/ng 77.428 52.398
unpack_fp 10.422 26.769
hal_modexp2/ng 66.920 25.550
The second table is all about the first signature, where we calculate
the blinding factors, the modulus coefficient, and the Montgomery
factor. The blinding factors require doing a modexp, which requires
calculating the factors for N, while the signing requires calculating
the factors for P and Q.
For the modexpa7 core, the "precalc" is done in hardware; for the
modexpng core, it's done in software.
modexpa7 modexpng
pkey_local_sign_rsa 910.379 620.545
hal_rsa_decrypt 875.589 585.810
rsa_crt 871.767 582.056
create_blinding_factors 706.332 378.762
modexp 585.381 257.834
precalc N 246.373
hal_modexp/ng 577.831 3.941
precalc N 570.390
(rest of modexp) 7.441
modexp2/ng 150.675 203.276
unpack_fp 10.422 26.769
precalc P/Q 150.878
hal_modexp2/ng 140.167 25.549
precalc P/Q 73.118
In conclusion, ModExpNG is significantly faster, and it's worth
switching over completely to it.
Also, thanks to Joachim for pointing me at the Cortex-M4's DWT function.
It really made it a lot easier to measure execution time, with a higher
degree of confidence.
paul
More information about the Tech
mailing list