[Cryptech Tech] Happier RSA timing numbers
Rob Austein
sra at hactrn.net
Wed May 23 05:14:23 UTC 2018
Summary: 10.3 sig/sec throughput for 2048-bit RSA with eight Modexp
cores and four AES cores (Joachim's most recent version).
AES keyunwrap still dominates, but less of it is thumb
twiddling waiting for the AES core.
There are some relatively minor improvements we might be able to make
to the FMC I/O code (remove some vestigial stuff which dates back to
the bridge board and preemptive tasking, move byteswapping to the FPGA
where we can do it with wires), but none of it's likely to be radical.
Might be worth doing anyway, since the ARM is underpowered and some of
the vestigial stuff chews up ARM CPU time to no useful purpose.
The following call graph excerpt was from a test version of the
firmware with the "vestigial stuff" removed from the FMC code (seems
to work, at least, it passed all unit tests as well as a signature run
with eight clients and validation enabled):
index % time self children called name
-----------------------------------------------
0.00 51.25 3000/3000 hal_rpc_pkey_sign [4]
[5] 88.2 0.00 51.25 3000 pkey_local_sign [5]
0.00 46.04 3000/3000 hal_ks_fetch [7]
0.00 5.21 3000/3000 pkey_local_sign_rsa [19]
0.00 0.00 6000/6806031 memset [44]
0.00 0.00 3000/33907 hal_critical_section_start [122]
0.00 0.00 3000/33907 hal_critical_section_end [192]
-----------------------------------------------
0.00 0.28 37122/6000552 hal_aes_keywrap [57]
0.33 45.48 5963430/6000552 hal_aes_keyunwrap [8]
[6] 79.3 0.33 45.76 6000552 do_block [6]
0.76 17.30 6000552/6003854 hal_io_wait [12]
1.96 12.44 18001656/19712064 hal_io_write [14]
1.05 12.25 12001104/21468882 hal_io_read [10]
-----------------------------------------------
0.00 46.04 3000/3000 pkey_local_sign [5]
[7] 79.2 0.00 46.04 3000 hal_ks_fetch [7]
0.00 45.82 3000/3000 hal_aes_keyunwrap [8]
0.00 0.20 3000/3072 hal_ks_lock [65]
0.00 0.02 3000/3024 hal_mkm_get_kek [85]
0.00 0.00 3000/6036 hal_ks_index_find [126]
0.00 0.00 3000/369530 memcpy [83]
0.00 0.00 3000/3024 ks_volatile_test_owner [205]
0.00 0.00 3000/6036 hal_ks_block_read_cached [198]
0.00 0.00 3000/6084 hal_ks_cache_mark_used [195]
0.00 0.00 3000/3072 hal_ks_unlock [202]
-----------------------------------------------
0.00 45.82 3000/3000 hal_ks_fetch [7]
[8] 78.9 0.00 45.82 3000 hal_aes_keyunwrap [8]
0.33 45.48 5963430/6000552 do_block [6]
0.00 0.01 3000/3024 load_kek [98]
0.00 0.00 3000/9042 hal_core_free [90]
0.00 0.00 3000/3042 hal_core_alloc [110]
0.00 0.00 3000/6073 memmove [156]
-----------------------------------------------
3.12 5.02 31734336/98674308 fmc_write_32 [16]
6.59 10.59 66939972/98674308 fmc_read_32 [11]
[9] 43.6 9.71 15.61 98674308 _fmc_nwait_idle [9]
15.61 0.00 98674308/98674308 HAL_GPIO_ReadPin [15]
-----------------------------------------------
Be warned that this call graph is not strictly comparable to the
previous ones: the profiling script had been testing not just multiple
key sizes but also multiple numbers of clients, I dropped the latter
in favor of always trying just what should be the optimal number of
clients (now that theory and reality seem to match there --
deliberately running too many clients is useful as a torture test, but
may not give a very accurate picture of where the time is going for
the case we really care about).
Full profiling report available if folks want to look at it. Be
warned that we're already past the point where enabling profiling has
a significant effect on throughput (current penalty is roughly 2x).
I'll send a copy of the proposed change to the FMC code to Paul and
Pavel for review before pushing it to master.
More information about the Tech
mailing list