[Cryptech Core] further improvements in RSA signing performance
Paul Selkirk
paul at psgd.org
Mon Mar 16 18:08:57 UTC 2020
Hey guys, how would you like 100+ sig/sec?
I saw Pavel's README comments about NUM_MULTS (default 8), and tried
it with values of 16 and 32. It turned out that a number of functions
assumed a value of 8, but I think I got them all eventually. At any
rate, it now works for me like this:
NUM_MULTS = 8, 4 cores: 80.6 sig/sec
NUM_MULTS = 16, 3 cores: 99.5
NUM_MULTS = 32, 3 cores: 116.3
(libhal/tests/parallel-signatures.py, 4 signers, 100K signatures)
To measure core utilization, I temporarily changed the core allocation
strategy from LRU to first-available, and added allocation counters.
Below is core utilization for NUM_MULTS=8 with 4 cores:
3300: modexpng 0.21 46154 41.6%
4300: modexpng 0.21 44961 40.5%
5300: modexpng 0.21 19279 17.4%
6300: modexpng 0.21 617 0.5%
Since the 4th core is essentially unused (and given the difficulty of
synthesizing a bitstream with a lot of complicated cores), I built
NUM_MULTS=16 and =32 with only 3 modexpng cores.
Here is the current breakdown of what goes into signing (in ms/sig),
for different values of NUM_MULTS:
8 16 32
parallel-signatures 48.502 41.368 34.507
RPC overhead 28.146 28.476 25.077
hal_rpc_pkey_sign 20.356 12.892 9.430
hal_ks_fetch 1.520
hal_mkm_get_kek 0.415
hal_aes_keyunwrap 1.021
pkey_local_sign_rsa 18.790 11.316 7.854
hal_rsa_private_key_from_der 0.557
hal_rsa_decrypt 18.223 10.749 7.287
rsa_crt 18.103 10.629 7.167
modexpng 18.089 10.615 7.154
unpack_fp 0.325
hal_modexpng 17.643 10.169 6.707
write key comps 0.633
calculation 17.010 9.535 6.074
It may be possible to squeeze a little more performance by one or more
of the following:
- Integrating hal_mkm_get_kek into the keywrap core. Mainlining the
keywrap core is my next project, then merging the integrate_mkmif
branch. This wouldn't save the full 0.415ms, because the keywrap core
would still need to fetch the KEK, but it would save some FMC I/O, and
make the whole thing more secure.
- Store key components and blinding factors as bignum arrays rather
than fp_int. Once the key is generated and the precalc factors are
calculated, we're not doing any more bignum math in software. This
would save 0.325ms.
- If the same key is being re-used, don't bother to re-write the key
components to the modexpng core, just the message and the blinding
factors. This could save 0.633ms, but would add state and complexity
to the signing tasks.
Note that all of these gain us fractions of milliseconds.
Clocking the cores at 180MHz should (in theory) cut about 3ms.
paul
More information about the Core
mailing list