[Cryptech Tech] Happier RSA timing numbers

Joachim Strömbergson joachim.strombergson at assured.se
Wed May 23 08:32:02 UTC 2018


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Aloha!

To expand on the "paralle cores" below.

After the modifications I did yesterday, the AES core now performs one
round/cycle. Improving that considerably would mean trying to do more
than one round/cycle. Based on the worst delay path in the core, this
will have a bad effect on the possible clock frequency. It will quite
probably be below 100 MHz. So improving single AES core performance a
lot must come from increasing the clock frequency it is running at, not
changes to the design.

Pipelining would allow a single core to process more blocks/cycle. But
since you are using parallel cores, this basically gives the same results.

A question: Are the parallel cores loaded with the same keys? If so,
what happens if you increase the number of AES cores?

Since we are down to one round/cycle, the cycles needed to signal ready,
capture next/init etc are actually responsible for a decent percentage
of the total processing time. I can look at the design and see if there
is another cycle or two that could be shaved off.

After this the big thing I can do is the streaming interface I've been
talking about.

The API would be something like:

1. Write the number of ECB blocks to be processed with a given key. This
will reset internal counters etc.

2. Write all block words to the same address in sequence.

3. Wait for ready to be set

4. Read out the resulting block words.


This should improve the block by block write-wait-read loop. This could
be improved even further by setting ready not when all blocks has been
processed, but when the first block has been processed. This however
risk leads to starvation and a more complicated solution. I would
suggest we do the simple version first and see if it helps.

What do you think Rob?

Regards,
JoachimS


Joachim Strömbergson wrote:
> Aloha!
> 
> Just to get some clarifications - what was the number of 2048 sigs/s
> before the AES updates?
> 
> A little bit shame not to have numbers that could be compared
> directly to be able to confirm the improvements.
> 
> Since you are running w parallel AES cores that way of improving
> things is already used. The next thing should be double sys_clk to
> 100 MHz. That should drop the wait time.
> 
> After that, removing per block wait by implementing streaming/big
> data is probably the next improvement.
> 
> Regards, JoachimS
> 
>> On 23 May 2018, at 07:14, Rob Austein <sra at hactrn.net> wrote:
>> 
>> Summary: 10.3 sig/sec throughput for 2048-bit RSA with eight
>> Modexp cores and four AES cores (Joachim's most recent version). 
>> AES keyunwrap still dominates, but less of it is thumb twiddling
>> waiting for the AES core.
>> 
>> There are some relatively minor improvements we might be able to
>> make to the FMC I/O code (remove some vestigial stuff which dates
>> back to the bridge board and preemptive tasking, move byteswapping
>> to the FPGA where we can do it with wires), but none of it's likely
>> to be radical. Might be worth doing anyway, since the ARM is
>> underpowered and some of the vestigial stuff chews up ARM CPU time
>> to no useful purpose.
>> 
>> The following call graph excerpt was from a test version of the 
>> firmware with the "vestigial stuff" removed from the FMC code
>> (seems to work, at least, it passed all unit tests as well as a
>> signature run with eight clients and validation enabled):
>> 
>> index % time    self  children    called     name 
>> ----------------------------------------------- 0.00   51.25
>> 3000/3000        hal_rpc_pkey_sign [4] [5]     88.2    0.00   51.25
>> 3000         pkey_local_sign [5] 0.00   46.04    3000/3000
>> hal_ks_fetch [7] 0.00    5.21    3000/3000
>> pkey_local_sign_rsa [19] 0.00    0.00    6000/6806031     memset
>> [44] 0.00    0.00    3000/33907       hal_critical_section_start
>> [122] 0.00    0.00    3000/33907       hal_critical_section_end
>> [192] ----------------------------------------------- 0.00    0.28
>> 37122/6000552     hal_aes_keywrap [57] 0.33   45.48 5963430/6000552
>> hal_aes_keyunwrap [8] [6]     79.3    0.33   45.76 6000552
>> do_block [6] 0.76   17.30 6000552/6003854     hal_io_wait [12] 1.96
>> 12.44 18001656/19712064     hal_io_write [14] 1.05   12.25
>> 12001104/21468882     hal_io_read [10] 
>> ----------------------------------------------- 0.00   46.04
>> 3000/3000        pkey_local_sign [5] [7]     79.2    0.00   46.04
>> 3000         hal_ks_fetch [7] 0.00   45.82    3000/3000
>> hal_aes_keyunwrap [8] 0.00    0.20    3000/3072        hal_ks_lock
>> [65] 0.00    0.02    3000/3024        hal_mkm_get_kek [85] 0.00
>> 0.00    3000/6036        hal_ks_index_find [126] 0.00    0.00
>> 3000/369530      memcpy [83] 0.00    0.00    3000/3024
>> ks_volatile_test_owner [205] 0.00    0.00    3000/6036
>> hal_ks_block_read_cached [198] 0.00    0.00    3000/6084
>> hal_ks_cache_mark_used [195] 0.00    0.00    3000/3072
>> hal_ks_unlock [202] 
>> ----------------------------------------------- 0.00   45.82
>> 3000/3000        hal_ks_fetch [7] [8]     78.9    0.00   45.82
>> 3000         hal_aes_keyunwrap [8] 0.33   45.48 5963430/6000552
>> do_block [6] 0.00    0.01    3000/3024        load_kek [98] 0.00
>> 0.00    3000/9042        hal_core_free [90] 0.00    0.00
>> 3000/3042        hal_core_alloc [110] 0.00    0.00    3000/6073
>> memmove [156] ----------------------------------------------- 3.12
>> 5.02 31734336/98674308     fmc_write_32 [16] 6.59   10.59
>> 66939972/98674308     fmc_read_32 [11] [9]     43.6    9.71   15.61
>> 98674308         _fmc_nwait_idle [9] 15.61    0.00
>> 98674308/98674308     HAL_GPIO_ReadPin [15] 
>> -----------------------------------------------
>> 
>> Be warned that this call graph is not strictly comparable to the 
>> previous ones: the profiling script had been testing not just
>> multiple key sizes but also multiple numbers of clients, I dropped
>> the latter in favor of always trying just what should be the
>> optimal number of clients (now that theory and reality seem to
>> match there -- deliberately running too many clients is useful as a
>> torture test, but may not give a very accurate picture of where the
>> time is going for the case we really care about).
>> 
>> Full profiling report available if folks want to look at it.  Be 
>> warned that we're already past the point where enabling profiling
>> has a significant effect on throughput (current penalty is roughly
>> 2x).
>> 
>> I'll send a copy of the proposed change to the FMC code to Paul
>> and Pavel for review before pushing it to master. 
>> _______________________________________________ Tech mailing list 
>> Tech at cryptech.is https://lists.cryptech.is/listinfo/tech
> 
> _______________________________________________ Tech mailing list 
> Tech at cryptech.is https://lists.cryptech.is/listinfo/tech


- -- 
Med vänlig hälsning, Yours

Joachim Strömbergson - Assured AB
========================================================================

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBCAAGBQJbBScCAAoJEF3cfFQkIuyNK3EQAJj05CGhOYs6iK2LGsCTFxqo
dTZmxOqTE7UbU9BbrDXFkIwCIoOB1l+3Y6S9o0mTA9PsIZcrQmM8hlLnypqS62/j
bb/k+lYs5FAUjAUfTpfSx2o2L9TdMRyaLjAfQVYmERDXlYna+fc1juqDc6k/ff90
eu5nCZm1BOHOz4R8b4cmEbelYvLN3317qZMWG9dadIc+KhEYgtCN9PIio75vGvF/
Mj6oq1PxXcM/m/omrSfiue5QP9I4VsMCmVX+irAbcZZfzxu7Ae+mCz/yhtaMNx67
Kl+ajxrfy5XpuUkzKN/+rgcBAD5pz9o2ZFySvYuFqXuO9L+f19RejdCJnankGaLI
nczPQeF89wvO1CzaxIpAdME8wcqp+DubOjEzoQ8btf4JBI/I+KE8rG252ZVasLUB
7AY65Krs5fmL9dd8glI1BRq65ylczth/j+pyfdHgkrHtpiqPCyZAa+BVxt0uTEuG
xtcw0026rSxuauRpxSayoP6mzLtdnIH3ZP8cAEtHsz7gGX9pj2wfPiJvew+/txVF
g9CAutA1I2hvhdHfNpmvUHbw0k202U5wubGCmRbLXtjQvaF/yk7dQMeuGlMmvlUA
goEV37YoUk9bynkVCJzP6H9ppyMnnh6LwYrlZPjVtPtjqX1WPMrqWTr3TkA/x0YS
DOAb6DU9kYHb55PBBCn1
=z6Sq
-----END PGP SIGNATURE-----


More information about the Tech mailing list