[Cryptech-Commits] [core/math/modexpng] branch master created (now 465cdf0)

git at cryptech.is git at cryptech.is
Sat Mar 14 18:18:39 UTC 2020


This is an automated email from the git hooks/post-receive script.

paul at psgd.org pushed a change to branch master
in repository core/math/modexpng.

      at 465cdf0  Moved modexpng from user/shatov to core/math.

This branch includes the following new commits:

     new 6194d48  Randomized test vector generation scripts for ModExpNG.
     new ca52eb0  Don't track the test vector itself.
     new c01d11d  Instructions on how to use the vector generation scripts.
     new eb859f8  Added blinding support to test vector generation scripts.
     new f1a8087  Updated readme file.
     new 3ef1813  ModExpNG ("Next Generation") math model.
     new ecbc1b7  Added blinding into math model.
     new c30c0bd  Mutate blinding tuple.
     new 701e3f1  Added optional output of intermediate quantities for debugging. Reworked index rotation code for better readability.
     new 711ffbd  Simplified index calculation and accumulator clearing logic. Better debug printout of accumulators.
     new b5a8b52   * more debugging output  * more precise modelling of DSP slice
     new 766bb93  Rewrote "square" recombination to match how it works in hardware.
     new a105c87  Same changes for "triangle" multiplication phase as for the "square" one (debugging output, simpler MAC clearing and index rotation logic).
     new ac6bc69  Cosmetic fixes.
     new aaf45e2  Removed some boilerplate code, all the three multiplication flavours are now working consistently. Still need to rework recombination routines.
     new e79b4bb  Fixed 4096-bit test vector generation.
     new 345be75  Intermediate version to fix recombinaton overflow bug.
     new c165ddc  * Added more debugging options:  - intentionally trigger internal overflow handler  - dump MAC inputs  - dump intermediate numbers during the reduction phase
     new 66be583  * Started conversion of the model to use micro-operations
     new a5200cd  * Added more micro-operations
     new b0fb263  * MASSIVE CLEANUP
     new 0beee22  * More cleanup (got rid of .wide. and .narrow.)
     new ec07464  Moved to "modexpng_fpga_model" repo, this one was meant for Verilog.
     new 29fb6af  Started working on the pipelined Montgomery modular multiplier. Currently can do the "square" part of the multiplication, i.e. compute the twice larger intermediate product AB = A * B.
     new 9e9689d  Further work on the Montgomery modular multiplier. Can now to the "triangular" part of multiplication, i.e. compute the "magic" reduction coefficient Q = LSB(AB) * N_COEFF.
     new ecf0374  Further work on the Montgomery modular multiplier. Added the third "rectangular" stage of the multiplication process, i.e. computation of how many copies of the modulus N to add to the intermediate product AB to zeroize the lower half: M = Q * N.
     new 3ea94c8  Implemented the final stage of the Montgomery modular multiplication, i.e. addition of AB and M then reduction by right-shift.
     new fde62e3  Major rewrite (different core hierarchy, buses, wrappers, etc).
     new 71f7025  Redesigned core architecture, unified bank structure. All storage blocks now have eight 4kbit entries and occupy one 36K BRAM tile.
     new 0b4b42d  Redesigned storage modules, added top-level module, added I/O storage space.
     new affada8  Reworked storage architecture (moved I/O memory to a separate module, since there's only one instance of input/output values, while storage manager has dual storage space for P and Q multipliers).
     new 8ee5a19  Expanded micro-operation parameters (added dedicated control bit to force the B input of the modular multiplier to 1, this is necessary to bring numbers out of Montgomery domain).
     new e340b14  Added more micro-operations, also added "general worker" module. The worker is basically a block memory data mover, but it can also do some supporting operations required for the Garner's formula part of the exponentiation.
     new 0224778  Added more micro-operations, entire Montgomery exponentiation ladder works now.
     new 1e33032  Refactored general worker module Added modular subtraction micro-operation
     new 3213b3e  Added "MERGE_LH" micro-operation. To be able to do Garner's formula we need regular (not modular) multiplication. We're doing this by telling the modular multiplier to stop after the "square" step, which computes A*B. The problem is that the multiplier stores the lower part of the product in the internal bank L and the upper part in the internal bank H, but we need to be able to do operations on the product as a whole. MERGE_LH that combines the two halves of the product [...]
     new 3633901  Added the regular (not modular) addition operation required during the final step of the Garner's formula algorithm. Note, that the addition is "uneven" in the sense, that the first operand is full-size (as wide as the modulus), while the second one is only half the size. The adder internally banks the second input port during the second half of the addition.
     new 9eac252  Entire CRT signature algorithm works by now.
     new 72902f5  Redesigned the testbench. Core clock does not necessarily need to be twice faster than the bus clock now. It can be the same, or say four times faster.
     new 69b5d9f  Added support for non-CRT mode. Further refactoring.
     new 584393a  Further work:  - added core wrapper  - fixed module resets across entire core (all the resets are now consistently    active-low)  - continued refactoring
     new edd5efd  Reworked testbench, clk_sys and clk_core can now have any ratio, not necessarily 1:2.
     new 57d250b  Fixed all the testbenches to work with the latest RTL sources.
     new 657678a  Added simulation-only code to measure multiplier load.
     new 88f46be  Fixed port width mismatch warning.
     new d50bb60  Added readme file.
     new 0f111bf  Added demo driver code for STM32.
     new 6d2cdbf  Added missing copyright headers.
     new 65bf054  Beautified the README.md, should look somewhat less nasty now.
     new f4771a7  The uOP engine didn't compile at 180 MHz. The pipeline had two stages: FETCH and DECODE. Apparently one clock cycle is not enough to entirely decode an instruction, so decoding now takes two clock cycles (DECODE_1 and DECODE_2). This seems to solve the problem. If we run into more timing violations here, we can add an extra DECODE_3 cycle and register the currently combinatorial uop_opcode_* flags at DECODE_2. This fix increases the core's latency by 59/32 clock cycles ( [...]
     new 863cac9  Refactored reductor module.
     new 189dfd6  Removed the latch accidentally created while pipelining the uOP engine module. The FSM previously had four states encoded using two bits, so the next state logic didn't have a default case, since all the possible states were used. Addition of the fifth state required one more state bit, so the FSM now has five states out eight possible and a default case is thus necessary.
     new 157d5de  Small change to the reductor module to try to get past 180 MHz. Previously BRAM outputs were going directry into a LUT-based ternary adder which was causing timing problems. Added a layer of flip-flops, so instead of BRAM -> LUT -> FF we have BRAM -> FF -> LUT -> FF. This increases core latency by (number_of_supporting_modular_multiplications + number_of_exponent_bits) ticks.
     new 59a54d6  Forgot to push minor cosmetic fix.
     new 1d0458f  Cosmetic fix.
     new 6791175  One more cosmetic fix.
     new 83f8779  Had to rework the general worker module to reach 180 MHz core clock. The module is responsible for doing certain supporting operations (mostly moving operands between banks and doing some simple math operations, such as modular subtraction and regular addition). Depending on the particular operation, one of three bank address space sweep patterns was used:  * one-pass (for things like carry propagation)  * two-pass (for things like modular subtraction that produce interm [...]
     new 6a0438e  Turns out, fabric addition and subtraction in the general worker module are actually in the critical paths of the ModExpNG core and are plaguing the place and route tools. I was barely able to achieve timing closure at 180 MHz even with the highest Map and PaR effort levels. This means that any further clock frequency increase is effectively impossible, moreover any small change in the design may prevent it from meeting timing constants. The obvious solution is to use DS [...]
     new e5f4454  Reworked modular subtraction micro-operation. Previously it used "two-pass" bank address space sweep, during the first pass (a-b) and (a-b+n) were computed, during the second pass either the former or the latter quantity was written to the output bank (depending on the very last borrow flag value). This is no longer possible, since the FSM now only generates one "interleaved" address space sweep. The solution is to split one complex modular subtraction operation into sim [...]
     new ab061af  This commit modifies the REGULAR_ADD_UNEVEN micro-operation to use DSP slices for addition instead of fabric logic. This opcode is only necessary when in CRT mode and is executed once per entire exponentiation to recombine the two "easier" exponentiations. This was the final change necessary to get rid of using fabric math in the general worker module.
     new b985d45   * DSP slices now have two use modes: MULT and ADD/SUB  * cosmetic rename of Verilog include file
     new 147dcd3  Removed old DSP wrappers.
     new a1314f3  Added two pairs of new wrappers.
     new b585d25  Cosmetic fix that only involves debug output during simulation.
     new b590661  Updated microcode source to match the changes made to general worker module.
     new b8c0536  Updated uOP engine to match the changes made to the general worker module (modular subtraction was split into three micro-operations instead of one).
     new 6492883  For the new general worker module to work we need dynamic switching of DSP OPMODE, ALUMODE and CARRYINSEL ports, thus more defined constants.
     new 8f7829c  Tiny cosmetic typo fix ("dst" -> "dsp")
     new 2345e42  Added more meaningful constants to avoid certain hardcoded numbers.
     new c551625  Refactored modular reductor module.
     new 04bc457  Renumbered micro-operations.
     new 76f89d6  The I/O manager has to work in sync with the general worker module. Made the necessary changes to make it work after the general worker update. Also moved debug simulation-time code into a separate file.
     new e68e11a  Update DSP wrapper instance names.
     new c6029af  Refactored MMM recombinator module, accomodated the changes in DSP slice wrapper names.
     new b0cbd33  Refactored the MMM module, now uses meaningful constant names from the include file, not hardcoded widths.
     new c4bee71  Cosmetic change to easily switch tests on/off.
     new 97d5b79  Bump version number.
     new fa360f8  Cosmetic rename of FSM states.
     new 6b335ce   * more consistent port names  * optional two-stage pipeline for A&B ports
     new 6093e06  Accomodate the changes to DSP slice wrappers.
     new d2ae99a  This commit accomodates the changes made to DSP slice wrappers and also fixes the subtle math overflow bug introduced while switching to DSP-based partial multiplication product recombination.
     new 9217682  Uniform testbenches.
     new bc4eb54  Better handling of debug output (no need to manally adjust word count anymore).
     new c5d2101  Updated DSP slice wrappers for the new partial product recombination algorithm:  - unified clock enable for A:B and C ports  - A:B and C ports now always have fixed 1-cycle latency  - added new Z multiplexor modes in the generic wrapper
     new 2500312  Added new DSP slice OPMODEs for the new recombination algorithm.
     new 58ad4d9  Adapted to the changes in the DSP slice wrappers.
     new f2598e0  Improved debugging options:  * flush console after each ladder iteration for smoother progress output  * ability to truncate internal powering ladder loop at desired step (this will    only work when using simulation mode, obviously)
     new 77d1148  New partial product recombination algorithm.
     new 13abd96  Increment version number.
     new 2791a17  More elegant way to do partial product recombination:  * take advantage of the cascade paths between DSP slices  * decrease latency of operation
     new d35ff84  Update STM32 demo driver.
     new 465cdf0  Moved modexpng from user/shatov to core/math.

The 92 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.




More information about the Commits mailing list