[Cryptech Tech] Alpha Platform Upgrade
Pavel Shatov
meisterpaul1 at yandex.ru
Wed Feb 12 12:04:30 UTC 2020
Hi,
yesterday I pushed a number of commits, that were spread across several
repositories. The changes include the following pieces:
1. FMC bus arbiter was upgraded. Code from the `fmc_clk' branch was
merged into master, this removes the old clock domain crossing logic and
makes the I/O bus synchronous. Due to a limitation in the STM32
processor the I/O can't run at the same 90 MHz clock as the cores, so
the I/O runs at 90 / 2 = 45 MHz. Strictly speaking, the new FMC arbiter
still does 45/90 MHz clock domain crossing, but since the two clocks are
synchronous, it's much simpler and has lower latency. Currently the
total FMC bus transaction duration is 9 clock cycles (1 cycle
synchronization stage + 4 cycles FMC controller latency + 3 cycles core
selector latency + 1 cycle clock domain crossing logic), so the
throughput is 32 bits * 45 MHz / 9 = 160 megabits/second.
2. Alpha platform modules were upgraded. The cores are now clocked by 90
MHz clock, not 60. The new clock manager module now also has an extra
dedicated high-speed output port for cores, that support higher clock
speeds. Currently this frequency is 45 * 4 = 180 MHz and is only used by
the new ModExpNG core (see below).
3. Core selector was upgraded. As we were adding new cores to the design
and trying to increase their clock speed, timing problems started to
arise when building the bitstream. The two primary reasons for this are
the global asynchronous reset net and the high fanout of the readback
data multiplexor. The first problem was solved by addition of a special
parametrized module, that replicates the reset signal (see commit
message for detailed description of how the module works). The second
problem was alleviated by relaxing the setup constraint for the
selector's output multiplexor. In short, since the I/O runs at only 45
MHz, it doesn't make sense to select the output data value at 90 MHz.
Multi-cycle constraints were introduced that give data two 90 MHz clock
cycles instead of one to propagate through the multiplexor. Again, the
corresponding commit message explains how exactly this works.
4. ModExpNG core was upgraded. The core now supports up to 180 MHz
high-speed clock. Latest performance measurements are the following:
Exponentiation time in milliseconds:
w/ CRT | non-CRT
----------+----------
1024-bit key: 1.37 ms | 8.28 ms
2048-bit key: 8.46 ms | 61.10 ms
4096-bit key: 61.72 ms | 475.08 ms
Speed in exponentiations per second:
w/ CRT | non-CRT
-------------+-------------
1024-bit key: 731.4 exp/s | 120.7 exp/s
2048-bit key: 118.2 exp/s | 16.4 exp/s
4096-bit key: 16.2 exp/s | 2.1 exp/s
The next step would be to upgrade the HAL layer of the STM32 firmware.
The crucial change is that the newer ModExp core has built-in support
for blinding and also does the final part of the Chinese Remainder
Theorem based signing ("Garner's formula") itself, so the STM32 now has
no need to do any modular math operations when signing at all. I'm not
very much familiar with how the HAL layer works, so I'd pass this to
someone with better understanding to make the changes.
--
With best regards,
Pavel Shatov
More information about the Tech
mailing list