[Cryptech Tech] RSA Key Format

Pavel Shatov meisterpaul1 at yandex.ru
Wed Mar 6 18:21:35 UTC 2019


Rob, I have certain questions about the way we store RSA keys. The 
background for those who have not been following is that I've been 
working on a faster ModExp core with full CRT support and integrated 
blinding. Those modifications will change the distribution of 
computations done inside the STM32 and on the FPGA, so as Rob pointed 
out, it makes sense to re-evaluate where and when specific things are 
carried out.

The "standard" RSA key (at least in the OpenSSL sense, as far as I know) 
consists of the following quantities: modulus, public exponent, secret 
exponent, two smaller primes, two shorter exponents and CRT helper 
coefficient. Our ModExp currently needs two additional quantities 
besides the aforementioned "standard" set of numbers.

The first one is the "Montgomery factor", it only depends on the modulus 
and is relatively easy to compute. Given modulus N which is L bits long, 
the factor is 2 ** (2*L) mod N.

The second quantity is the modulus-dependent reduction coefficient. 
Montgomery modular reduction works by adding a certain number of 
multiples of the modulus to the intermediate result to make some of the 
least significant bits zero. Then the product is reduced by simply 
shifting it to the right. The problem is to determine how many multiples 
to add, and that's where the coefficient comes into play.

Software implementations typically compute only the lowest 32 bits of 
this coefficient on the fly and then work by repeatedly clearing the 
lower 32 bits of the intermediate product. This is not viable in 
hardware, because both hardware multipliers and adders have latency. As 
soon as you're finished computing the multiple of modulus, you need to 
add it to the intermediate result. While you're adding, the multipliers 
will be idle, once you've added, you start multiplying and the adders 
are stalled. Hardware needs entire coefficient to keep the math pipeline 
busy at all times.

We currently have two decicated pieces of Verilog that do the two 
precomputations. I'm trying to understand whether it would be viable to 
offload the computation to the STM32 and get rid of those Verilog 
modules to simplify the core.

I think, that it may even be possible to not store the Montgomery factor 
at all and just precompute it on the fly when a key is loaded.

The reduction coefficient is computed according to the extended 
Euclidean algorithm and, as far as I remember, takes about the same time 
as the exponentiation itself, so it still makes sense to precompute it 
and store along with other key components.

There's also a math trick that allows you to get ~10% speed increase at 
the expense of precomputing one more word of the reduction coefficient 
than there are words in the modulus. ModExp internally operates on 
16-bit words, because that's what the math slices in the FPGA can 
handle. So to take advantage of the trick we need to store 1040-bit 
quantity for 1024-bit keys, 2064-bit quantity for 2048-bit keys, etc. I 
wonder how inconvenient that might be?

With best regards,
Pavel Shatov

More information about the Tech mailing list