[Cryptech Core] further improvements in RSA signing performance

Paul Selkirk paul at psgd.org
Mon Mar 16 18:08:57 UTC 2020


Hey guys, how would you like 100+ sig/sec?

I saw Pavel's README comments about NUM_MULTS (default 8), and tried
it with values of 16 and 32. It turned out that a number of functions
assumed a value of 8, but I think I got them all eventually. At any
rate, it now works for me like this:

NUM_MULTS =  8, 4 cores:  80.6 sig/sec
NUM_MULTS = 16, 3 cores:  99.5
NUM_MULTS = 32, 3 cores: 116.3

(libhal/tests/parallel-signatures.py, 4 signers, 100K signatures)

To measure core utilization, I temporarily changed the core allocation
strategy from LRU to first-available, and added allocation counters.
Below is core utilization for NUM_MULTS=8 with 4 cores:

3300: modexpng 0.21 46154  41.6%
4300: modexpng 0.21 44961  40.5%
5300: modexpng 0.21 19279  17.4%
6300: modexpng 0.21 617     0.5%

Since the 4th core is essentially unused (and given the difficulty of
synthesizing a bitstream with a lot of complicated cores), I built
NUM_MULTS=16 and =32 with only 3 modexpng cores.

Here is the current breakdown of what goes into signing (in ms/sig),
for different values of NUM_MULTS:

                                       8        16        32
parallel-signatures                 48.502    41.368    34.507
RPC overhead                        28.146    28.476    25.077
hal_rpc_pkey_sign                   20.356    12.892     9.430
  hal_ks_fetch                       1.520
    hal_mkm_get_kek                  0.415
    hal_aes_keyunwrap                1.021
  pkey_local_sign_rsa               18.790    11.316     7.854
    hal_rsa_private_key_from_der     0.557
    hal_rsa_decrypt                 18.223    10.749     7.287
      rsa_crt                       18.103    10.629     7.167
        modexpng                    18.089    10.615     7.154
          unpack_fp                  0.325
          hal_modexpng              17.643    10.169     6.707
            write key comps          0.633
            calculation             17.010     9.535     6.074

It may be possible to squeeze a little more performance by one or more
of the following:

- Integrating hal_mkm_get_kek into the keywrap core. Mainlining the
keywrap core is my next project, then merging the integrate_mkmif
branch. This wouldn't save the full 0.415ms, because the keywrap core
would still need to fetch the KEK, but it would save some FMC I/O, and
make the whole thing more secure.

- Store key components and blinding factors as bignum arrays rather
than fp_int. Once the key is generated and the precalc factors are
calculated, we're not doing any more bignum math in software. This
would save 0.325ms.

- If the same key is being re-used, don't bother to re-write the key
components to the modexpng core, just the message and the blinding
factors. This could save 0.633ms, but would add state and complexity
to the signing tasks.

Note that all of these gain us fractions of milliseconds.
Clocking the cores at 180MHz should (in theory) cut about 3ms.

				paul


More information about the Core mailing list