# [Cryptech Tech] RSA Key Format

Pavel Shatov meisterpaul1 at yandex.ru
Wed Mar 6 18:21:35 UTC 2019

```Hi,

Rob, I have certain questions about the way we store RSA keys. The
background for those who have not been following is that I've been
working on a faster ModExp core with full CRT support and integrated
blinding. Those modifications will change the distribution of
computations done inside the STM32 and on the FPGA, so as Rob pointed
out, it makes sense to re-evaluate where and when specific things are
carried out.

The "standard" RSA key (at least in the OpenSSL sense, as far as I know)
consists of the following quantities: modulus, public exponent, secret
exponent, two smaller primes, two shorter exponents and CRT helper
coefficient. Our ModExp currently needs two additional quantities
besides the aforementioned "standard" set of numbers.

The first one is the "Montgomery factor", it only depends on the modulus
and is relatively easy to compute. Given modulus N which is L bits long,
the factor is 2 ** (2*L) mod N.

The second quantity is the modulus-dependent reduction coefficient.
Montgomery modular reduction works by adding a certain number of
multiples of the modulus to the intermediate result to make some of the
least significant bits zero. Then the product is reduced by simply
shifting it to the right. The problem is to determine how many multiples
to add, and that's where the coefficient comes into play.

Software implementations typically compute only the lowest 32 bits of
this coefficient on the fly and then work by repeatedly clearing the
lower 32 bits of the intermediate product. This is not viable in
hardware, because both hardware multipliers and adders have latency. As
soon as you're finished computing the multiple of modulus, you need to
will be idle, once you've added, you start multiplying and the adders
are stalled. Hardware needs entire coefficient to keep the math pipeline
busy at all times.

We currently have two decicated pieces of Verilog that do the two
precomputations. I'm trying to understand whether it would be viable to
offload the computation to the STM32 and get rid of those Verilog
modules to simplify the core.

I think, that it may even be possible to not store the Montgomery factor
at all and just precompute it on the fly when a key is loaded.

The reduction coefficient is computed according to the extended
Euclidean algorithm and, as far as I remember, takes about the same time
as the exponentiation itself, so it still makes sense to precompute it
and store along with other key components.

There's also a math trick that allows you to get ~10% speed increase at
the expense of precomputing one more word of the reduction coefficient
than there are words in the modulus. ModExp internally operates on
16-bit words, because that's what the math slices in the FPGA can
handle. So to take advantage of the trick we need to store 1040-bit
quantity for 1024-bit keys, 2064-bit quantity for 2048-bit keys, etc. I
wonder how inconvenient that might be?

--
With best regards,
Pavel Shatov
```