[CryptechCommits] [user/shatov/modexpng] branch master updated: Beautified the README.md, should look somewhat less nasty now.
git at cryptech.is
git at cryptech.is
Wed Nov 13 19:07:16 UTC 2019
This is an automated email from the git hooks/postreceive script.
meisterpaul1 at yandex.ru pushed a commit to branch master
in repository user/shatov/modexpng.
The following commit(s) were added to refs/heads/master by this push:
new 65bf054 Beautified the README.md, should look somewhat less nasty now.
65bf054 is described below
commit 65bf05440677643d9c1f2ae6a0573315f52926c8
Author: Pavel V. Shatov (Meister) <meisterpaul1 at yandex.ru>
AuthorDate: Wed Nov 13 22:02:44 2019 +0300
Beautified the README.md, should look somewhat less nasty now.

README.md  342 +++++++++++++++++++++++++++++++++++
1 file changed, 194 insertions(+), 148 deletions()
diff git a/README.md b/README.md
index 1b08005..5f10a82 100644
 a/README.md
+++ b/README.md
@@ 1,162 +1,208 @@
# ModExpNG
+ModExpNG
+========
## Core Description
+Core Description
+
This "nextgeneration" IP core implements the modular exponentiation math primitive using the specialized DSP slices present in the Artix7 FPGA found on the CrypTech Alpha board. It can be used during RSA operations such as encryption/decryption and signing. Given a pair of blinding factors, the core internally blinds the message before signing and then unblinds it afterwards. The blinding factors are automatically mutated during each exponentiation for later reuse. Another prominent fe [...]
## CompileTime Settings
+CompileTime Settings
+
The core has one synthesistime parameter:
 * **NUM_MULTS**  Sets the number of DSP slices to use per modular multiplier, which must be a power of 2. Each multiplier consumes NUM_MULTS + 1 slices, since one additional multiplier is required to eliminate the final conditional subtraction step of the Montgomery modular multiplication routine. The core internally consists of a pair of <i>dual modular multipliers</i>, thus the total number of occupied DSP slices is 4 * (NUM_MULTS + 1). When in CRT mode, both <i>dual multipliers</i> [...]

 NUM_MULTS  DSP Usage 

 8  36 
 16  68 
 32  132 
 64  260 

Currently the core has been tested in hardware with a balanced setting of NUM_MULTS = 8. Larger values should decreate latency, but proportionally increase DSP slice usage. The core takes advantage of the dedicated highspeed cascade paths between DSP slices, thus all the NUM_MULTS slices must be placed in the same DSP column. Given that the column height for the Artix7 FPGA is 100 DSP slices, the upper limit on NUM_MULTS is 64. This has not been yet tested in hardware and may require f [...]

## API Specification

The interface of the core is similar to other CrypTech cores. FMC memory map is split into four regions, each region is 32 kilobits (4 kilobytes). The first one contains core registers and looks like the following:

 Offset  Register 

 0x0000  NAME0 
 0x0004  NAME1 
 0x0008  VERSION 
 0x0020  CONTROL 
 0x0024  STATUS 
 0x0040  MODE 
 0x0044  MODULUS_BITS 
 0x0048  EXPONENT_BITS 
 0x004C  BANK_BITS 
 0x0050  NUM_MULTS 
+ * **NUM\_MULTS**  Sets the number of DSP slices to use per modular multiplier, which must be a power of 2. Each multiplier consumes NUM\_MULTS + 1 slices, since one additional multiplier is required to eliminate the final conditional subtraction step of the Montgomery modular multiplication routine. The core internally consists of a pair of _dual modular multipliers_, thus the total number of occupied DSP slices is 4 * (NUM\_MULTS + 1). When in CRT mode, both _dual multipliers_ operate [...]
+
+Combined DSP slice utilization is outlined in the following table:
+
+<table border rules="groups" cellpadding="5">
+<colgroup></col></col></colgroup>
+<thead> <tr><th> NUM_MULTS </th><th> DSP Usage </td></tr></thead>
+<tbody align="right"><tr><td> 8 </td><td> 36 </td></tr></tbody>
+<tbody align="right"><tr><td> 16 </td><td> 68 </td></tr></tbody>
+<tbody align="right"><tr><td> 32 </td><td> 132 </td></tr></tbody>
+<tbody align="right"><tr><td> 64 </td><td> 260 </td></tr></tbody>
+</table>
+
+Currently the core has been tested in hardware with a balanced setting of NUM\_MULTS = 8. Larger values should decreate latency, but proportionally increase DSP slice usage. The core takes advantage of the dedicated highspeed cascade paths between DSP slices, thus all the NUM\_MULTS slices must be placed in the same DSP column. Given that the column height for the Artix7 FPGA is 100 DSP slices, the upper limit on NUM\_MULTS is 64. This has not been yet tested in hardware and may requir [...]
+
+API Specification
+
+
+The interface of the core is similar to other CrypTech cores. FMC memory map is split into four regions (REGISTERS, INPUT\_0, INPUT\_1 and OUTPUT), each region is 32 kilobits (4 kilobytes). The first one (REGISTERS) contains core registers and looks like the following:
+
+<table border rules="groups" cellpadding="5">
+<colgroup></col></colgroup>
+<thead><tr><th colspan=2> REGISTERS Memory Map </th></tr></thead>
+<colgroup></col></col></colgroup>
+<thead><tr><th> Offset </th><th> Register </th></tr></thead>
+<tbody><tr><td> 0x0000 </td><td> NAME0 </td></tr></tbody>
+<tbody><tr><td> 0x0004 </td><td> NAME1 </td></tr></tbody>
+<tbody><tr><td> 0x0008 </td><td> VERSION </td></tr></tbody>
+<tbody><tr><td> 0x0020 </td><td> CONTROL </td></tr></tbody>
+<tbody><tr><td> 0x0024 </td><td> STATUS </td></tr></tbody>
+<tbody><tr><td> 0x0040 </td><td> MODE </td></tr></tbody>
+<tbody><tr><td> 0x0044 </td><td> MODULUS_BITS </td></tr></tbody>
+<tbody><tr><td> 0x0048 </td><td> EXPONENT_BITS </td></tr></tbody>
+<tbody><tr><td> 0x004C </td><td> BANK_BITS </td></tr></tbody>
+<tbody><tr><td> 0x0050 </td><td> NUM_MULTS </td></tr></tbody>
+</table>
The core has the following registers:
 * **NAME0**, **NAME1**
Readonly core name ("``mode``", "``xpng``").

 * **VERSION**
Readonly core version, currently "``0.10``".

 * **CONTROL**
Register bits:</br>
``[31:2]`` Don't care, always read as 0</br>
``[1]`` "**next**" control bit</br>
``[0]`` Don't care, always read as 0</p>
The "**next**" control bit is used to start an exponentiation. The core is edgetriggered, this way if the bit is currently set, it must be cleared first and then set to 1 again to start a new exponentiation.

 * **STATUS**
Readonly register bits:</br>
``[31:2]`` Don't care, always read as 0</br>
``[1]`` "**valid**" status bit</br>
``[0]`` "**ready**" status bit, always read as 1</p>
The "**valid**" status bit is cleared as soon as the core starts exponentiation, and gets set after the operation is finished.

 * **MODE**
Mode register bits:</br>
``[31:2]`` Don't care, always read as 0</br>
``[1]`` "**CRT enable**" control bit</br>
``[0]`` Don't care, always read as 0</p>
The "**CRT enable**" control bit allows the core to take advantage of the Chinese remainder theorem to speed up RSA operations.

* **MODULUS_BITS**
Length of modulus in bits. Smallest allowed value is 64 * **NUM_MULTS** (currently 64 * 8 = 512), largest allowed value is 4096. The core rounds **MODULUS_BITS** down to the nearest multiple of 32 * **NUM_MULTS** (currently 32 * 8 = 256). Thus, allowed key lengths are 512, 768, 1024, 1280, ..., 4096. If the modulus is eg. 1000 bits wide, **MODULUS_BITS** must be set to 1024 and the modulus must be prepended with 24 zero bits to match the allowed length. Always set this to the length of t [...]

* **EXPONENT_BITS**
Length of exponent in bits. Smallest allowed value is 4, largest allowed value is 4096.

* **BANK_BITS**
Length of operand bank in bits. This readonly parameter returns the length of internal operand bank and allows the largest supported operand width to be determined at runtime. Currently **BANK_BITS** is hardwired to always return 4096.

* **NUM_MULTS**
This readonly parameter returns the width of the internal parallel multiplier, that was specified at compiletime. This parameter is currently 8.

The two following memory regions (2nd and 3rd) contain input quantities, each region is split into eight banks, each bank is 4096 bits (512 bytes). The least significat byte of an input quantity should be written to offset 0 in a bank, i.e. one should start filling banks by writing the least significant word at the lowest offset.

The second region contains the following banks:

 Offset  Bank 

 0x1000  M 
 0x1200  N 
 0x1400  N_FACTOR 
 0x1600  N_COEFF 
 0x1A00  X 
 0x1C00  Y 

 * **M** is the message to be signed (base).</br>
 * **N** is the modulus (public key).</br>
 * **N_FACTOR** is the **N** Montgomery factor and must be precomputed (described later).</br>
 * **N_COEFF** is the **N** modulusdependent speedup coefficient and must be precomputed (described later). Note, that the bank for **N_COEFF** is twice larger than normal banks.</br>
 * (**X**, **Y**) are a pair of blinding factors. Blinding is always enabled, there's no way to disable it, thus a pair of blinding factors is required for exponentiation. Note, that a pair of (**X**, **Y**) = (1, 1) works just fine, but you effectively aren't doing any blinding this way. The message is blinded using **Y** and unblinded using **X**, thus the relationship between the two must be <b>Y = (X ** 1) ** e</b>. Note, that this scheme won't work for encryption, either a pair of [...]

The third region contains the following banks. Note, that since the third region contains secret components, it is "writeonly", any reads from the region will return 0xDEADCODE.

 Offset  Bank 

 0x2000  D 
 0x2200  P 
 0x2300  DP 
 0x2400  P_FACTOR 
 0x2600  P_COEFF 
 0x2800  Q 
 0x2900  DQ 
 0x2A00  Q_FACTOR 
 0x2C00  Q_COEFF 
 0x2E00  QINV 

**D** is the exponent (secret exponent for signing, F4 is commonly used for encryption).</br>
**P**, **Q** are the smaller moduli, they must be supplied when CRT mode is enabled. Note, that their banks are twice smaller than normal banks.</br>
**DP**, **DQ** are private key components, they must be supplied when CRT mode is enabled. Note, that their banks are twice smaller than normal banks.</br>
**P_FACTOR**, **Q_FACTOR** are the **P** and **Q** Montgomery factors and must be precomputed (described later).</br>
**P_COEFF**, **Q_COEFF** are the **P** and **Q** modulidependent speedup coefficients and must be precomputed (described later).</br>
**QINV** is the private key compoment, it must be supplided when CRT mode is enabled.

The fourth region contains three banks where the core will store the output quantities:

 Offset  Bank 

 0x3000  S 
 0x3200  XM 
 0x3400  YM 

**S** is the signature (or ciphertext after encryption).</br>
(**XM**, **YM**) are a pair of mutated blinding factors.</br>


## Montgomery Factors and Modulusdependent Speedup Coefficients

The core uses the Montgomery modular multiplier, which can only operate on numbers in the so called Montgomery domain. Bringing input numbers into Montgomery domain and converting output numbers out of Montgomery domain requires a special quantity called Montgomery factor: **FACTOR** = 2 ** (2 * (**MODULUS_BITS** + **WORD_WIDTH**)) mod **N**. This quantity has at most as many bits as the modulus and should be precomputed during key generation. The core's internal data buses are 16bit wi [...]
```
F = 1
for i from 0 to 2 * (MODULUS_BITS + 16)  1
 F1 = F << 1
 F2 = F1  N
 F = (F2 < 0) ? F1 : F2
return F
```

The final step of Montgomery modular multiplication is Montgomery modular reduction. It is done by adding a multiple of the modulus to the intermediate product. The multiple is selected in such a way, that the lower half of the sum is all zero bits, so it can be safely reduced by just a trivial right shift. This speeds things up, since there's no more need to do computationally difficult division anymore. To determine the multiple of the modulus, another quantity is required, which is th [...]
```
R = 1
B = 1
NN = ~N + 1
for i from 1 to (MODULUS_BITS + 16  1)
 B = B << 1
 T = R * NN mod 2 ** (MODULUS_BITS + 16)
 if T[i] then
 R = R + B
return R
```
+ * **NAME0**, **NAME1**
+
+ Readonly core name ("``mode``", "``xpng``").
+
+ * **VERSION**
+
+ Readonly core version, currently "``0.10``".
+
+ * **CONTROL**
+
+ Register bits:
+ ``[31:2]`` Don't care, always read as 0
+ ``[1]`` "**next**" control bit
+ ``[0]`` Don't care, always read as 0
+
+ The "**next**" control bit is used to start an exponentiation. The core is edgetriggered, this way if the bit is currently set, it must be cleared first and then set to 1 again to start a new exponentiation.
+
+ * **STATUS**
+
+ Readonly register bits:
+ ``[31:2]`` Don't care, always read as 0
+ ``[1]`` "**valid**" status bit
+ ``[0]`` "**ready**" status bit, always read as 1
+
+ The "**valid**" status bit is cleared as soon as the core starts exponentiation, and gets set after the operation is finished.
+
+ * **MODE**
+
+ Mode register bits:
+ ``[31:2]`` Don't care, always read as 0
+ ``[1]`` "**CRT enable**" control bit
+ ``[0]`` Don't care, always read as 0
+
+ The "**CRT enable**" control bit allows the core to take advantage of the Chinese remainder theorem to speed up RSA operations.
+
+ * **MODULUS\_BITS**
+
+ Length of modulus in bits. Smallest allowed value is 64 * NUM\_MULTS (currently 64 * 8 = 512), largest allowed value is 4096. The core rounds MODULUS\_BITS down to the nearest multiple of 32 * NUM\_MULTS (currently 32 * 8 = 256). Thus, allowed key lengths are 512, 768, 1024, 1280, ..., 4096. If the modulus is eg. 1000 bits wide, MODULUS\_BITS must be set to 1024 and the modulus must be prepended with 24 zero bits to match the allowed length. Always set this to the length of the publi [...]
+
+ * **EXPONENT\_BITS**
+
+ Length of exponent in bits. Smallest allowed value is 4, largest allowed value is 4096.
+
+ * **BANK\_BITS**
+
+ Length of operand bank in bits. This readonly parameter returns the length of internal operand bank and allows the largest supported operand width to be determined at runtime. Currently BANK\_BITS is hardwired to always return 4096.
+
+ * **NUM_MULTS**
+
+ This readonly parameter returns the width of the internal parallel multiplier, that was specified at compiletime. This parameter is currently 8.
+
+The two following memory regions (INPUT\_0 and INPUT\_1) contain input quantities, each region is split into eight banks, each bank is 4096 bits (512 bytes). The least significat byte of an input quantity should be written to offset 0 in a bank, i.e. one should start filling banks by writing the least significant word at the lowest offset.
+
+The second region (INPUT\_0) contains the following banks:
+
+<table border rules="groups" cellpadding="5">
+<colgroup></col></colgroup>
+<thead><tr><th colspan=2> INPUT_0 Memory Map </th></tr></thead>
+<colgroup></col></col></colgroup>
+<thead><tr><th> Offset </th><th> Bank </th></tr></thead>
+<tbody><tr><td> 0x1000 </td><td> M </td></tr></tbody>
+<tbody><tr><td> 0x1200 </td><td> N </td></tr></tbody>
+<tbody><tr><td> 0x1400 </td><td> N_FACTOR </td></tr></tbody>
+<tbody><tr><td> 0x1600 </td><td> N_COEFF </td></tr></tbody>
+<tbody><tr><td> 0x1A00 </td><td> X </td></tr></tbody>
+<tbody><tr><td> 0x1C00 </td><td> Y </td></tr></tbody>
+</table>
+
+ * **M** is the message to be signed (base).
+
+ * **N** is the modulus (public key).
+
+ * **N\_FACTOR** is the **N** Montgomery factor and must be precomputed (described later).
+
+ * **N\_COEFF** is the **N** modulusdependent speedup coefficient and must be precomputed (described later). Note, that the bank for **N\_COEFF** is twice larger than normal banks.
+
+ * (**X**, **Y**) are a pair of blinding factors. Blinding is always enabled, there's no way to disable it, thus a pair of blinding factors is required for exponentiation. Note, that a pair of (**X**, **Y**) = (1, 1) works just fine, but you effectively aren't doing any blinding this way. The message is blinded using **Y** and unblinded using **X**, thus the relationship between the two must be **Y = (X \*\* 1) \*\* e**. Note, that this scheme won't work for encryption, either a pair of [...]
+
+The third region (INPUT\_1) contains the following banks. Note, that since the third region contains secret components, it is "writeonly", any reads from the region will return 0xDEADCODE.
+
+<table border rules="groups" cellpadding="5">
+<colgroup></col></colgroup>
+<thead><tr><th colspan=2> INPUT_1 Memory Map </th></tr></thead>
+<colgroup></col></col></colgroup>
+<thead><tr><th> Offset </th><th> Bank </th></tr></thead>
+<tbody><tr><td> 0x2000 </td><td> D </td></tr></tbody>
+<tbody><tr><td> 0x2200 </td><td> P </td></tr></tbody>
+<tbody><tr><td> 0x2300 </td><td> DP </td></tr></tbody>
+<tbody><tr><td> 0x2400 </td><td> P_FACTOR </td></tr></tbody>
+<tbody><tr><td> 0x2600 </td><td> P_COEFF </td></tr></tbody>
+<tbody><tr><td> 0x2800 </td><td> Q </td></tr></tbody>
+<tbody><tr><td> 0x2900 </td><td> DQ </td></tr></tbody>
+<tbody><tr><td> 0x2A00 </td><td> Q_FACTOR </td></tr></tbody>
+<tbody><tr><td> 0x2C00 </td><td> Q_COEFF </td></tr></tbody>
+<tbody><tr><td> 0x2E00 </td><td> QINV </td></tr></tbody>
+</table>
+
+ * **D** is the exponent (secret exponent for signing, F4 is commonly used for encryption).
+
+ * **P**, **Q** are the smaller moduli, they must be supplied when CRT mode is enabled. Note, that their banks are twice smaller than normal banks.
+
+ * **DP**, **DQ** are private key components, they must be supplied when CRT mode is enabled. Note, that their banks are twice smaller than normal banks.
+
+ * **P\_FACTOR**, **Q\_FACTOR** are the **P** and **Q** Montgomery factors and must be precomputed (described later).
+
+ * **P\_COEFF**, **Q\_COEFF** are the **P** and **Q** modulidependent speedup coefficients and must be precomputed (described later).
+
+ * **QINV** is the private key compoment, it must be supplided when CRT mode is enabled.
+
+The fourth region (OUTPUT) contains three banks where the core will store the output quantities:
+
+<table border rules="groups" cellpadding="5">
+<colgroup></col></colgroup>
+<thead><tr><th colspan=2> OUTPUT Memory Map </th></tr></thead>
+<colgroup></col></col></colgroup>
+<thead><tr><th> Offset </th><th> Bank </th></tr></thead>
+<tbody><tr><td> 0x3000 </td><td> S </td></tr></tbody>
+<tbody><tr><td> 0x3200 </td><td> XM </td></tr></tbody>
+<tbody><tr><td> 0x3400 </td><td> YM </td></tr></tbody>
+</table>
+
+ * **S** is the signature (or ciphertext after encryption).
+
+ * (**XM**, **YM**) are a pair of mutated blinding factors.
+
+Montgomery Factors and Modulusdependent Speedup Coefficients
+
+
+The core uses the Montgomery modular multiplier, which can only operate on numbers in the so called Montgomery domain. Bringing input numbers into Montgomery domain and converting output numbers out of Montgomery domain requires a special quantity called Montgomery factor: **FACTOR** = 2 \*\* (2 * (**MODULUS\_BITS** + **WORD\_WIDTH**)) mod **N**. This quantity has at most as many bits as the modulus and should be precomputed during key generation. The core's internal data buses are 16bi [...]
+
+ F = 1
+ for i from 0 to 2 * (MODULUS_BITS + 16)  1
+ F1 = F << 1
+ F2 = F1  N
+ F = (F2 < 0) ? F1 : F2
+ return F
+
+The final step of Montgomery modular multiplication is Montgomery modular reduction. It is done by adding a multiple of the modulus to the intermediate product. The multiple is selected in such a way, that the lower half of the sum is all zero bits, so it can be safely reduced by just a trivial right shift. This speeds things up, since there's no more need to do computationally difficult division anymore. To determine the multiple of the modulus, another quantity is required, which is th [...]
+
+ R = 1
+ B = 1
+ NN = ~N + 1
+ for i from 1 to (MODULUS_BITS + 16  1)
+ B = B << 1
+ T = R * NN mod 2 ** (MODULUS_BITS + 16)
+ if T[i] then
+ R = R + B
+ return R
The `stm32/modexpng_util.c` file also has reference C code, that computes the Montgomery factor and the speedup coeficient. Note that each keypair ideally requires three pairs of precomputed numbers: one for the public key **N** and one for each of the smaller secret moduli **P** and **Q**.
## References ##
1. [Hardware Aspects of Montgomery Modular Multiplication](https://eprint.iacr.org/2017/1115.pdf)
+References
+
+
+ 1. [Hardware Aspects of Montgomery Modular Multiplication](https://eprint.iacr.org/2017/1115.pdf)

To stop receiving notification emails like this one, please contact
the administrator of this repository.
More information about the Commits
mailing list