[Cryptech Core] Discussion: Status for performance improvements and benchmarking

Joachim Strömbergson joachim at assured.se
Tue Oct 23 11:50:15 UTC 2018

Previous message (by thread): [Cryptech Core] Updates after audit
Next message (by thread): [Cryptech Core] Proposal for new FPGA architecture
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Aloha!

On the last f2f we discussed possible improvements and optimizations
that we can do in order to increase the performance for the Cryptech HSM
on the current Alpha board [0].

However, at least I don't have a firm grip on where we stand in terms of
knowing what performance we actually have today, and where our hot-spots
really are.

If I look back at the work we have done we have stomped on the AES core,
the keywrapping operation (mainly removing excessive transfers between
the MCU and the FPGA), and making the FPGA running on the fmc clock thus
removing latency due to clock domain crossing. We have also worked on
running FPGA with a 90 MHz clock. This has not yet been reached, but is
something we should complete.


Since we have worked on optimizing away cycles in the MCU-FPGA access I
decided to look at what it takes to write an array of words from the MCU
memory to the FPGA.

The function hal_io_write() accepts an array to be written to an address
range in the FPGA. Internally the function perform array len calls to
the fmc_write_32(). For each call, the fmc_write_32 performs a complete
address calculation including bitmasking. The word is the written to the
calculated address. And then there is a call to _fmc_nwait_idle().
_fmc_nwait_idle() in turn calls HAL_GPIO_ReadPin() to actually read a
given pin.

All in all, for each 32-bit word in the array there are (at least) three
calls and a complete address calculation. And back again up the call
stack. I haven't looked at the ASM generated for this sequence. But I
wouldn't be surprised if there are tens of instructions for each word.

I don't know if word read and writes are in any way dominating the time
for performing for example a signing operation. But if is, there seems
to be some low hanging fruit to pick here. (Read is a little bit more
complex due a work around for a fmc bug.)


But where are our bottlenecks really? Which are the dominating parts of
a signing operation (for example)? What do we need to do to really find
out where we should focus? Let us please discuss this.


When we know where we should focus we can start improve things again.
For MCU - FPGA communication we could for example use the DMA in the MCU
and provide write buffers in the FPGA to ensure that writes of blocks up
to N words will always work. This would remove a lot of the calls and
per word address calculation performed.


[0] We also discussed possible minor and major board changes that could
improve the performance. For example using FPGAs of the same type, model
but with better speed grade to allow higher core clock speeds. Since the
core is clocked in synch with the FMC clock, this however might only
ensure that we more easily will meet thc clock speed target when
instantiating more cores to increase parallel operations.
-- 
Med vänlig hälsning, Yours

Joachim Strömbergson
========================================================================
                               Assured AB
========================================================================

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: OpenPGP digital signature
URL: <https://lists.cryptech.is/archives/core/attachments/20181023/95349699/attachment.sig>

Previous message (by thread): [Cryptech Core] Updates after audit
Next message (by thread): [Cryptech Core] Proposal for new FPGA architecture
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Core mailing list