[Cryptech Core] further improvements in RSA signing performance
Paul Selkirk
paul at psgd.org
Tue Mar 10 22:51:56 UTC 2020
Recall that, when we last left things with modexpng + keywrap cores, we
had sigs/sec in this range, with 1-4 simultaneous signers.
1: 13.358
2: 22.188
3: 25.696
4: 25.688
While I was digging into this, I noticed that libtfm was being built
with BITS=8192. This is the largest size of number that we expect to
be handling, but (for reasons of bignum math overflow handling)
storage is twice that, or 16KB.
However, the modexp cores have been configured to handle a maximum key
size of 4096 bits. (This is a limit we could (but don't) check for in
the software driver.) Also, an 8192-bit key, with all the extra
fields, would have a DER size of about 8824 bytes, which exceeds the
keystore block size of 8192 bytes. So there's really no point in
setting BITS above 4096.
When I rebuilt libtfm with the smaller bignum size, I immediately (and
surprisingly) saw 16-36% improvement in signing speeds:
1: 15.562
2: 28.384
3: 35.140
4: 35.105
At this point, I was still using a minimally-resourced bitstream, with
only 1 modexpng core. So I built a bitstream with 4 modexpng cores,
but (surprisingly), that didn't perform significantly better than 1
core.
Digging further, I noticed that hal_rsa_private_key_from_der, which
(mostly) copies byte strings into fp_ints, had much better performance
than unpack_fp, which copies fp_ints into byte strings. Lo and behold,
fp_to_unsigned_bin is almost pathologically pessimized. Copying code
from fp_read_unsigned_bin, I saw a further improvement of 44-105% in
signing speeds:
1: 22.426
2: 43.913
3: 60.711
4: 72.244
Note that these tests are for 1000 signatures. Bumping it to 10000
signatures, I can drive the performance over 80 sig/sec, because it
amortizes the initial precalc across more signatures:
1: 22.565
2: 45.577
3: 64.916
4: 80.128
GUYS! This is legit 10x the last releng version!
Anyway, this is the current breakdown what goes into signing (in
ms/sig).
parallel-signatures 43.869
RPC overhead 23.483
hal_rpc_pkey_sign 20.386
hal_ks_fetch 1.529
hal_mkm_get_kek 0.419
hal_aes_keyunwrap 1.026
pkey_local_sign_rsa 18.811
hal_rsa_private_key_from_der 0.557
hal_rsa_decrypt 18.243
rsa_crt 18.124
modexpng 18.110
unpack_fp 0.325
hal_modexpng 17.663
An important thing to notice is that the RPC overhead (marshalling/
unmarshalling, serial communications, dispatch to an RPC task, etc) is
now more than the signing time itself.
And this is still with the modexpng core at 90MHz. Just think what
180MHz would buy us.
paul
More information about the Core
mailing list