[Cryptech Tech] ModExpNG Release

Pavel Shatov meisterpaul1 at yandex.ru
Thu Oct 24 11:28:10 UTC 2019


Hi,

I've pushed the ModExpNG ("next-generation") IP core, that I've been 
working on since December into the Git repository under 
/user/shatov/modexpng. Honestly, this has taken substantially longer, 
than I initially anticipated. I guess there are primarily two reasons 
for this. The first one is already ten months old and is now learning 
how to stand on his feet without support. The second one is the 
interruption of funding for the project.

The new core is mostly compatible with the ModExpA7 core we currently 
use, but there are a few differences:

1) The top-level wrapper module has an extra port called 'clk_core', 
that clocks DPS slices inside of the core. This allows the math logic to 
be clocked by a frequency different from the system (FMC bus) clock.

We currently use the 180/3 = 60 MHz clock for both I/O and cores. As an 
effort to increase signing performance, some time ago we decided to go 
for 180/4 = 45 MHz I/O and 45*2 = 90 MHz cores, but didn't actually do 
the change. The modified version of FMC arbiter has been sitting in my 
Git stash. With this new port we will be able to go for a different 
scheme with 45 MHz I/O, 45*2 = 90 MHz for all the other cores and 45*N 
MHz for ModExpNG. The upper limit on N seems to be 8, according to the 
datasheet, i.e. 45*8 = 360 MHz, but this has not been tested yet.

2) The core needs the same set of extra pre-computed numbers as the 
ModExpA7, but it can't pre-compute them by itself anymore. The 
pre-computation algorithm is totally different from the exponentiation 
procedure, so it made sense to leave that out and concentrate on 
optimization of just the exponentiation. I believe the pre-computation 
has to be done in the FPGA, because it involves the smaller secret 
moduli P and Q, so it must be constant-time at least. We can borrow the 
pre-computation modules from ModExpA7 (which was already written to 
operate in constant-time) and turn them into "helper" math modules, that 
is basically about just writing two wrappers. Another option is, since 
there have been talks about using a RISC-V core, we might try to do the 
pre-computations on a soft processor.

3) The core is hardwired to always do blinding. There are several 
reasons for this. First, Rob and I did some research, and the consensus 
seems to be something like "there are no solid proofs, that constant 
time modular exponentiation, even when hardened against power analysis 
and other potential attacks, is safe without blinding, so we'd better 
leave blinding enabled just in case". Second, the overhead from blinding 
is too small (<1%) to justify the extra complexity of adding a dedicated 
control bit and all that logic, that would be necessary to make internal 
operations conditional. If one still absolutely doesn't want to do 
blinding for some reason, a pair of (1, 1) can be used as the blinding 
factors.

I've also written some demo C code for the STM32 to demonstrate how to 
use the new core. The code uses randomized test vectors generated by the 
Python math model in /user/shatov/modexpng_fpga_model.

I also did measure the actual core performance in hardware. The Alpha 
board has a set of spare "IRQ" lines going from the STM32 to the FPGA, 
so I temporarily modified the demo driver to initially set the IRQ_0 pin 
low, then drive it high just before pulsing the "next" control bit and 
then again driving it low after the "valid" bit goes high. On the FPGA 
side I added a counter, that counts the number of ticks the IRQ_0 line 
stays high. Okay, yes, maybe it would have been easier to just use a 
scope and blink an LED, but I was too lazy to dust it off, sorry.

Below are the times it takes to do an exponentiation. Of course actual 
singing speed will be somewhat lower, because other steps, such as 
hashing the message, writing the hash into the core, etc are not taken 
into account:
                w/ CRT | no CRT
               --------+--------
1024-bit key:  3,9 ms |   24 ms
2048-bit key:   25 ms |  182 ms
4096-bit key:  184 ms | 1424 ms

The same thing in exponentiations per second:

                   w/ CRT |   no CRT
               -----------+----------
1024-bit key:  260 exp/s |  41 exp/s
2048-bit key:   40 exp/s | 5,4 exp/s
4096-bit key:  5,4 exp/s | 0,7 exp/s

The numbers above are for the current 60 MHz clock and single core 
instance. Based on current resource usage (2% FFs, 4% LUTs, 4% DSPs, 11% 
BRAMs), in theory we can crunch about eight cores into the bitstream and 
increase the clock frequency six times, so the upper theoretical limit 
on singing speed is 48x the numbers above. I don't know yet, how close 
we can get to that. I will try to do some test implementations in the 
following days to have more specific numbers for the "restart" follow-up 
call.

As a side note, the "classic" boost from CRT on a conventional processor 
is four times. Since computational complexity of multiplication is 
O(n**2), given twice smaller moduli, it takes 4x fewer operations. The 
exponent also becomes two times smaller, that's another 2x, but one has 
to do two exponentiations instead of one, so that 2x doesn't count. 
Programmable logic implementation on the other hand can take advantage 
of that additional 2x, because it can do the two easier exponentiations 
simultaneously. It can be clearly seen from the timings above, that the 
non-CRT/CRT ratio approaches 8 as the key length increases.

The core should already be usable in its current state, but there are a 
few things that should be done in the future:

1) Determine how fast the DSP slices inside of the core can be clocked. 
This is basically changing the clock period constraint, then running an 
implementation and seeing whether timing is met.

2) Try synthesizing a multi-core bitstream. The core takes advantage of 
the dedicated cascade paths between DSP slices. The problem is that 
Xilinx tools do not always take that into account and sometimes end up 
with an unroutable design, so manual floorplanning might be necessary.

3) Write at least some documentation.

4) The core is currently configured with NUM_MULTS = 8. This sets the 
number of parallel DSP slices doing the multiplication. NUM_MULTS can 
also be set to 16, 32 or 64. Doubling NUM_MULTS will synthesize a core, 
that is twice faster, but is also twice larger. Basically we can trade 
resources for speed by adjusting NUM_MULTS. Note, that for multi-core 
signer the total performance is not affected by this setting, with 
NUM_MULTS = 16 we can only fit four larger cores into the device instead 
of eight smaller ones.

So far I haven't tried anything above 8. I suspect NUM_MULTS = 64 is too 
extreme won't work as-is, as it will create a column of 64 DSP slices 
that spans ~2/3 of the device. All the DSP slices in a column share a 
common input, so there will be a huge 1:64 operand broadcast tree that 
most probably won't meet timing without additional pipelining.

Basically the question is to find the right balance between many smaller 
cores or few larger cores. Maybe the current setting of NUM_MULTS = 8 is 
just fine and we won't want to change it.


-- 
With best regards,
Pavel Shatov


More information about the Tech mailing list