# [Cryptech Tech] RSA Key Format

Pavel Shatov meisterpaul1 at yandex.ru
Tue Mar 12 11:17:02 UTC 2019

```07.03.2019 9:15, Rob Austein пишет:
>
>> Software implementations typically compute only the lowest 32 bits of
>> this coefficient on the fly and then work by repeatedly clearing the
>> lower 32 bits of the intermediate product. This is not viable in
>> hardware, because both hardware multipliers and adders have
>> latency. As soon as you're finished computing the multiple of modulus,
>> you need to add it to the intermediate result. While you're adding,
>> the multipliers will be idle, once you've added, you start multiplying
>> and the adders are stalled. Hardware needs entire coefficient to keep
>> the math pipeline busy at all times.
>>
>> We currently have two decicated pieces of Verilog that do the two
>> precomputations. I'm trying to understand whether it would be viable
>> to offload the computation to the STM32 and get rid of those Verilog
>> modules to simplify the core.
>
> What, you don't want to do it on a RISC-V?
>
> Trying this in software is might be worth the experiment.  Is this
> dependent only on the key's public components?  Constant time is nice,
> but is not so critical when the input values are already public.

Haven't thought about doing it on a RISC-V core, good point. Also nice
that you mentioned constant-time. Given modulus N which is L bits long,
the Montgomery factor is 2 ** (2*L) mod N, I currently compute it in
Verilog this way:

1. F = 1
2. for i=0 to 2*N-1
3.   F1 = F << 1
4.   F2 = F1 - N
5.   F = (F2 < 0) ? F1 : F2

If we're doing CRT, we operate on the smaller primes which are secret,
so yes, that must be constant-time. And that was the initial reason I
wrote that helper Verilog module.

The reduction coefficient is more complicated, since it effectively
needs modular inversion: R = -N ** -1 mod 2 ** L. There are several
algorithms in the literature, I'm currently doing it this way:

1. R = 1
2. B = 1
3. NN = ~N + 1
4. for i=1 to L-1
5.   B = B << 1
6.   T = R * NN mod 2 ** L
7.   if T[i] then
8.     R = R + B

I believe there may exist faster ways to do it, but given that this
computation should only be done one per key generation, I don't know
whether it will make sense to further optimize this. I believe
constant-time is also required here since we need the coefficients for
the smaller primes to do CRT.

In that light I'm starting to think that my idea to offload the
computation to STM32 is not that smart after all. Speaking of RISC-V,
can it get us true constant-time operation?

>
>> There's also a math trick that allows you to get ~10% speed increase
>> at the expense of precomputing one more word of the reduction
>> coefficient than there are words in the modulus. ModExp internally
>> operates on 16-bit words, because that's what the math slices in the
>> FPGA can handle. So to take advantage of the trick we need to store
>> 1040-bit quantity for 1024-bit keys, 2064-bit quantity for 2048-bit
>> keys, etc. I wonder how inconvenient that might be?
>
> So you want to trade in your old extra storage for new?  OK.
> Plenty of PRIVATE tag values left if we need them.

Yes, that was basically my idea.

>
>
> The other thing we discussed briefly a few days ago was short-lived
> storage for RSA blinding factors, so that we can amortize the cost of
> generating them over multiple signature operations with some kind of
> relatively cheap mutation after each signature.  We currently do this
> using a bit of memory hanging off the side of the in-memory pkey
> object.  I don't think it makes sense to stuff blinding factors into
> the keystore, since that would require a flash write after each
> signature and this isn't really the sort of thing we want to preserve
> for very long in any case.  So I think the current tradeoff there is
> about right, but figured I should put this out there while we're
> discussing the rest of it in case others have better ideas.

I'm now looking into how to integrate blinding into the core. Suppose
that our modulus is N = P * Q and the message to sign is M. When doing
CRT, we do two "easier" exponentiations mod P and mod Q, but the message
M is twice larger. So we have to first compute two new bases MP = M mod
P and MQ = M mod Q. Now do I get it right, that what we want to do is we
blind the original twice larger message M? In theory we can blind the
two smaller bases separately. Okay, the latter may be a totally stupid
thing, because I haven't worked out all the math details yet, just asking.

--
With best regards,
Pavel Shatov
```