[Cryptech Tech] Alpha Platform Upgrade

Pavel Shatov meisterpaul1 at yandex.ru
Wed Feb 12 12:04:30 UTC 2020


Hi,

yesterday I pushed a number of commits, that were spread across several 
repositories. The changes include the following pieces:

1. FMC bus arbiter was upgraded. Code from the `fmc_clk' branch was 
merged into master, this removes the old clock domain crossing logic and 
makes the I/O bus synchronous. Due to a limitation in the STM32 
processor the I/O can't run at the same 90 MHz clock as the cores, so 
the I/O runs at 90 / 2 = 45 MHz. Strictly speaking, the new FMC arbiter 
still does 45/90 MHz clock domain crossing, but since the two clocks are 
synchronous, it's much simpler and has lower latency. Currently the 
total FMC bus transaction duration is 9 clock cycles (1 cycle 
synchronization stage + 4 cycles FMC controller latency + 3 cycles core 
selector latency + 1 cycle clock domain crossing logic), so the 
throughput is 32 bits * 45 MHz / 9 = 160 megabits/second.

2. Alpha platform modules were upgraded. The cores are now clocked by 90 
MHz clock, not 60. The new clock manager module now also has an extra 
dedicated high-speed output port for cores, that support higher clock 
speeds. Currently this frequency is 45 * 4 = 180 MHz and is only used by 
the new ModExpNG core (see below).

3. Core selector was upgraded. As we were adding new cores to the design 
and trying to increase their clock speed, timing problems started to 
arise when building the bitstream. The two primary reasons for this are 
the global asynchronous reset net and the high fanout of the readback 
data multiplexor. The first problem was solved by addition of a special 
parametrized module, that replicates the reset signal (see commit 
message for detailed description of how the module works). The second 
problem was alleviated by relaxing the setup constraint for the 
selector's output multiplexor. In short, since the I/O runs at only 45 
MHz, it doesn't make sense to select the output data value at 90 MHz. 
Multi-cycle constraints were introduced that give data two 90 MHz clock 
cycles instead of one to propagate through the multiplexor. Again, the 
corresponding commit message explains how exactly this works.

4. ModExpNG core was upgraded. The core now supports up to 180 MHz 
high-speed clock. Latest performance measurements are the following:

Exponentiation time in milliseconds:

                 w/ CRT  |  non-CRT
               ----------+----------
1024-bit key:   1.37 ms |   8.28 ms
2048-bit key:   8.46 ms |  61.10 ms
4096-bit key:  61.72 ms | 475.08 ms

Speed in exponentiations per second:

                   w/ CRT   |   non-CRT
               -------------+-------------
1024-bit key:  731.4 exp/s | 120.7 exp/s
2048-bit key:  118.2 exp/s |  16.4 exp/s
4096-bit key:   16.2 exp/s |   2.1 exp/s


The next step would be to upgrade the HAL layer of the STM32 firmware. 
The crucial change is that the newer ModExp core has built-in support 
for blinding and also does the final part of the Chinese Remainder 
Theorem based signing ("Garner's formula") itself, so the STM32 now has 
no need to do any modular math operations when signing at all. I'm not 
very much familiar with how the HAL layer works, so I'd pass this to 
someone with better understanding to make the changes.


-- 
With best regards,
Pavel Shatov


More information about the Tech mailing list