[Cryptech Tech] ModExpNG Release
Pavel Shatov
meisterpaul1 at yandex.ru
Thu Oct 24 11:28:10 UTC 2019
Hi,
I've pushed the ModExpNG ("next-generation") IP core, that I've been
working on since December into the Git repository under
/user/shatov/modexpng. Honestly, this has taken substantially longer,
than I initially anticipated. I guess there are primarily two reasons
for this. The first one is already ten months old and is now learning
how to stand on his feet without support. The second one is the
interruption of funding for the project.
The new core is mostly compatible with the ModExpA7 core we currently
use, but there are a few differences:
1) The top-level wrapper module has an extra port called 'clk_core',
that clocks DPS slices inside of the core. This allows the math logic to
be clocked by a frequency different from the system (FMC bus) clock.
We currently use the 180/3 = 60 MHz clock for both I/O and cores. As an
effort to increase signing performance, some time ago we decided to go
for 180/4 = 45 MHz I/O and 45*2 = 90 MHz cores, but didn't actually do
the change. The modified version of FMC arbiter has been sitting in my
Git stash. With this new port we will be able to go for a different
scheme with 45 MHz I/O, 45*2 = 90 MHz for all the other cores and 45*N
MHz for ModExpNG. The upper limit on N seems to be 8, according to the
datasheet, i.e. 45*8 = 360 MHz, but this has not been tested yet.
2) The core needs the same set of extra pre-computed numbers as the
ModExpA7, but it can't pre-compute them by itself anymore. The
pre-computation algorithm is totally different from the exponentiation
procedure, so it made sense to leave that out and concentrate on
optimization of just the exponentiation. I believe the pre-computation
has to be done in the FPGA, because it involves the smaller secret
moduli P and Q, so it must be constant-time at least. We can borrow the
pre-computation modules from ModExpA7 (which was already written to
operate in constant-time) and turn them into "helper" math modules, that
is basically about just writing two wrappers. Another option is, since
there have been talks about using a RISC-V core, we might try to do the
pre-computations on a soft processor.
3) The core is hardwired to always do blinding. There are several
reasons for this. First, Rob and I did some research, and the consensus
seems to be something like "there are no solid proofs, that constant
time modular exponentiation, even when hardened against power analysis
and other potential attacks, is safe without blinding, so we'd better
leave blinding enabled just in case". Second, the overhead from blinding
is too small (<1%) to justify the extra complexity of adding a dedicated
control bit and all that logic, that would be necessary to make internal
operations conditional. If one still absolutely doesn't want to do
blinding for some reason, a pair of (1, 1) can be used as the blinding
factors.
I've also written some demo C code for the STM32 to demonstrate how to
use the new core. The code uses randomized test vectors generated by the
Python math model in /user/shatov/modexpng_fpga_model.
I also did measure the actual core performance in hardware. The Alpha
board has a set of spare "IRQ" lines going from the STM32 to the FPGA,
so I temporarily modified the demo driver to initially set the IRQ_0 pin
low, then drive it high just before pulsing the "next" control bit and
then again driving it low after the "valid" bit goes high. On the FPGA
side I added a counter, that counts the number of ticks the IRQ_0 line
stays high. Okay, yes, maybe it would have been easier to just use a
scope and blink an LED, but I was too lazy to dust it off, sorry.
Below are the times it takes to do an exponentiation. Of course actual
singing speed will be somewhat lower, because other steps, such as
hashing the message, writing the hash into the core, etc are not taken
into account:
w/ CRT | no CRT
--------+--------
1024-bit key: 3,9 ms | 24 ms
2048-bit key: 25 ms | 182 ms
4096-bit key: 184 ms | 1424 ms
The same thing in exponentiations per second:
w/ CRT | no CRT
-----------+----------
1024-bit key: 260 exp/s | 41 exp/s
2048-bit key: 40 exp/s | 5,4 exp/s
4096-bit key: 5,4 exp/s | 0,7 exp/s
The numbers above are for the current 60 MHz clock and single core
instance. Based on current resource usage (2% FFs, 4% LUTs, 4% DSPs, 11%
BRAMs), in theory we can crunch about eight cores into the bitstream and
increase the clock frequency six times, so the upper theoretical limit
on singing speed is 48x the numbers above. I don't know yet, how close
we can get to that. I will try to do some test implementations in the
following days to have more specific numbers for the "restart" follow-up
call.
As a side note, the "classic" boost from CRT on a conventional processor
is four times. Since computational complexity of multiplication is
O(n**2), given twice smaller moduli, it takes 4x fewer operations. The
exponent also becomes two times smaller, that's another 2x, but one has
to do two exponentiations instead of one, so that 2x doesn't count.
Programmable logic implementation on the other hand can take advantage
of that additional 2x, because it can do the two easier exponentiations
simultaneously. It can be clearly seen from the timings above, that the
non-CRT/CRT ratio approaches 8 as the key length increases.
The core should already be usable in its current state, but there are a
few things that should be done in the future:
1) Determine how fast the DSP slices inside of the core can be clocked.
This is basically changing the clock period constraint, then running an
implementation and seeing whether timing is met.
2) Try synthesizing a multi-core bitstream. The core takes advantage of
the dedicated cascade paths between DSP slices. The problem is that
Xilinx tools do not always take that into account and sometimes end up
with an unroutable design, so manual floorplanning might be necessary.
3) Write at least some documentation.
4) The core is currently configured with NUM_MULTS = 8. This sets the
number of parallel DSP slices doing the multiplication. NUM_MULTS can
also be set to 16, 32 or 64. Doubling NUM_MULTS will synthesize a core,
that is twice faster, but is also twice larger. Basically we can trade
resources for speed by adjusting NUM_MULTS. Note, that for multi-core
signer the total performance is not affected by this setting, with
NUM_MULTS = 16 we can only fit four larger cores into the device instead
of eight smaller ones.
So far I haven't tried anything above 8. I suspect NUM_MULTS = 64 is too
extreme won't work as-is, as it will create a column of 64 DSP slices
that spans ~2/3 of the device. All the DSP slices in a column share a
common input, so there will be a huge 1:64 operand broadcast tree that
most probably won't meet timing without additional pipelining.
Basically the question is to find the right balance between many smaller
cores or few larger cores. Maybe the current setting of NUM_MULTS = 8 is
just fine and we won't want to change it.
--
With best regards,
Pavel Shatov
More information about the Tech
mailing list