[Cryptech Tech] Happier RSA timing numbers

Rob Austein sra at hactrn.net
Wed May 23 05:14:23 UTC 2018


Summary: 10.3 sig/sec throughput for 2048-bit RSA with eight Modexp
	 cores and four AES cores (Joachim's most recent version).
	 AES keyunwrap still dominates, but less of it is thumb
	 twiddling waiting for the AES core.

There are some relatively minor improvements we might be able to make
to the FMC I/O code (remove some vestigial stuff which dates back to
the bridge board and preemptive tasking, move byteswapping to the FPGA
where we can do it with wires), but none of it's likely to be radical.
Might be worth doing anyway, since the ARM is underpowered and some of
the vestigial stuff chews up ARM CPU time to no useful purpose.

The following call graph excerpt was from a test version of the
firmware with the "vestigial stuff" removed from the FMC code (seems
to work, at least, it passed all unit tests as well as a signature run
with eight clients and validation enabled):

index % time    self  children    called     name
-----------------------------------------------
                0.00   51.25    3000/3000        hal_rpc_pkey_sign [4]
[5]     88.2    0.00   51.25    3000         pkey_local_sign [5]
                0.00   46.04    3000/3000        hal_ks_fetch [7]
                0.00    5.21    3000/3000        pkey_local_sign_rsa [19]
                0.00    0.00    6000/6806031     memset [44]
                0.00    0.00    3000/33907       hal_critical_section_start [122]
                0.00    0.00    3000/33907       hal_critical_section_end [192]
-----------------------------------------------
                0.00    0.28   37122/6000552     hal_aes_keywrap [57]
                0.33   45.48 5963430/6000552     hal_aes_keyunwrap [8]
[6]     79.3    0.33   45.76 6000552         do_block [6]
                0.76   17.30 6000552/6003854     hal_io_wait [12]
                1.96   12.44 18001656/19712064     hal_io_write [14]
                1.05   12.25 12001104/21468882     hal_io_read [10]
-----------------------------------------------
                0.00   46.04    3000/3000        pkey_local_sign [5]
[7]     79.2    0.00   46.04    3000         hal_ks_fetch [7]
                0.00   45.82    3000/3000        hal_aes_keyunwrap [8]
                0.00    0.20    3000/3072        hal_ks_lock [65]
                0.00    0.02    3000/3024        hal_mkm_get_kek [85]
                0.00    0.00    3000/6036        hal_ks_index_find [126]
                0.00    0.00    3000/369530      memcpy [83]
                0.00    0.00    3000/3024        ks_volatile_test_owner [205]
                0.00    0.00    3000/6036        hal_ks_block_read_cached [198]
                0.00    0.00    3000/6084        hal_ks_cache_mark_used [195]
                0.00    0.00    3000/3072        hal_ks_unlock [202]
-----------------------------------------------
                0.00   45.82    3000/3000        hal_ks_fetch [7]
[8]     78.9    0.00   45.82    3000         hal_aes_keyunwrap [8]
                0.33   45.48 5963430/6000552     do_block [6]
                0.00    0.01    3000/3024        load_kek [98]
                0.00    0.00    3000/9042        hal_core_free [90]
                0.00    0.00    3000/3042        hal_core_alloc [110]
                0.00    0.00    3000/6073        memmove [156]
-----------------------------------------------
                3.12    5.02 31734336/98674308     fmc_write_32 [16]
                6.59   10.59 66939972/98674308     fmc_read_32 [11]
[9]     43.6    9.71   15.61 98674308         _fmc_nwait_idle [9]
               15.61    0.00 98674308/98674308     HAL_GPIO_ReadPin [15]
-----------------------------------------------

Be warned that this call graph is not strictly comparable to the
previous ones: the profiling script had been testing not just multiple
key sizes but also multiple numbers of clients, I dropped the latter
in favor of always trying just what should be the optimal number of
clients (now that theory and reality seem to match there --
deliberately running too many clients is useful as a torture test, but
may not give a very accurate picture of where the time is going for
the case we really care about).

Full profiling report available if folks want to look at it.  Be
warned that we're already past the point where enabling profiling has
a significant effect on throughput (current penalty is roughly 2x).

I'll send a copy of the proposed change to the FMC code to Paul and
Pavel for review before pushing it to master.


More information about the Tech mailing list