[Cryptech Tech] RSA Key Format

Rob Austein sra at hactrn.net
Thu Mar 7 06:15:47 UTC 2019

On Wed, 06 Mar 2019 13:21:35 -0500, Pavel Shatov wrote:
> Rob, I have certain questions about the way we store RSA keys. The
> background for those who have not been following is that I've been
> working on a faster ModExp core with full CRT support and integrated
> blinding. Those modifications will change the distribution of
> computations done inside the STM32 and on the FPGA, so as Rob pointed
> out, it makes sense to re-evaluate where and when specific things are
> carried out.
> The "standard" RSA key (at least in the OpenSSL sense, as far as I
> know) consists of the following quantities: modulus, public exponent,
> secret exponent, two smaller primes, two shorter exponents and CRT
> helper coefficient.

OpenSSL supports more than one format, but the important ones in this
context are PKCS #1.5 and PKCS #8.  PKCS #8 is a generic private key
format consisting of an OID and some inner ASN.1 content; in the case
of RSA keys, that inner content is PKCS #1.5.

Most of the Cryptech code dealing with private keys uses PKCS #8,
because the OID labels the key type without need of additional magic.

> Our ModExp currently needs two additional
> quantities besides the aforementioned "standard" set of numbers.
> The first one is the "Montgomery factor", it only depends on the
> modulus and is relatively easy to compute. Given modulus N which is L
> bits long, the factor is 2 ** (2*L) mod N.
> The second quantity is the modulus-dependent reduction
> coefficient. Montgomery modular reduction works by adding a certain
> number of multiples of the modulus to the intermediate result to make
> some of the least significant bits zero. Then the product is reduced
> by simply shifting it to the right. The problem is to determine how
> many multiples to add, and that's where the coefficient comes into
> play.

We currently handle these via a private extension to PKCS #1.5, which
we strip off in the (rare) cases where we export private keys.  The
code for this is in git.cryptech.is/sw/libhal/rsa.c, but may be a bit
obscure if you haven't memorized RFCS 2313 and 5208.  Basically, the
inner (extended PKCS #1.5) looks like:

   RSAPrivateKey ::= SEQUENCE {
     version		INTEGER,
     modulus		INTEGER, -- n
     publicExponent	INTEGER, -- e
     privateExponent	INTEGER, -- d
     prime1		INTEGER, -- p
     prime2		INTEGER, -- q
     exponent1		INTEGER, -- d mod (p-1)
     exponent2		INTEGER, -- d mod (q-1)
     coefficient	INTEGER, -- (inverse of q) mod p

where the *C values are reduction coefficients and the *F values are
"Montgomery factors".  We store these on first use for n, p, and q.
They're OPTIONAL so that we don't need to waste space on ones we
aren't using, and they're OCTET STRINGs rather than INTEGERs because
they're fixed length byte strings retrieved from the ModExp core.

There are of course other formats we could use, but we had to support
ASN.1 in any case, and using formats that other tools understand makes
debugging a bit simpler. :)

> Software implementations typically compute only the lowest 32 bits of
> this coefficient on the fly and then work by repeatedly clearing the
> lower 32 bits of the intermediate product. This is not viable in
> hardware, because both hardware multipliers and adders have
> latency. As soon as you're finished computing the multiple of modulus,
> you need to add it to the intermediate result. While you're adding,
> the multipliers will be idle, once you've added, you start multiplying
> and the adders are stalled. Hardware needs entire coefficient to keep
> the math pipeline busy at all times.
> We currently have two decicated pieces of Verilog that do the two
> precomputations. I'm trying to understand whether it would be viable
> to offload the computation to the STM32 and get rid of those Verilog
> modules to simplify the core.

What, you don't want to do it on a RISC-V?

Trying this in software is might be worth the experiment.  Is this
dependent only on the key's public components?  Constant time is nice,
but is not so critical when the input values are already public.

> I think, that it may even be possible to not store the Montgomery
> factor at all and just precompute it on the fly when a key is loaded.
> The reduction coefficient is computed according to the extended
> Euclidean algorithm and, as far as I remember, takes about the same
> time as the exponentiation itself, so it still makes sense to
> precompute it and store along with other key components.

I'm fine with halving the storage requirements for the extra data :)

> There's also a math trick that allows you to get ~10% speed increase
> at the expense of precomputing one more word of the reduction
> coefficient than there are words in the modulus. ModExp internally
> operates on 16-bit words, because that's what the math slices in the
> FPGA can handle. So to take advantage of the trick we need to store
> 1040-bit quantity for 1024-bit keys, 2064-bit quantity for 2048-bit
> keys, etc. I wonder how inconvenient that might be?

So you want to trade in your old extra storage for new?  OK.
Plenty of PRIVATE tag values left if we need them.

The other thing we discussed briefly a few days ago was short-lived
storage for RSA blinding factors, so that we can amortize the cost of
generating them over multiple signature operations with some kind of
relatively cheap mutation after each signature.  We currently do this
using a bit of memory hanging off the side of the in-memory pkey
object.  I don't think it makes sense to stuff blinding factors into
the keystore, since that would require a flash write after each
signature and this isn't really the sort of thing we want to preserve
for very long in any case.  So I think the current tradeoff there is
about right, but figured I should put this out there while we're
discussing the rest of it in case others have better ideas.

More information about the Tech mailing list