[Cryptech-Commits] [core/math/modexpng] branch master created (now 465cdf0)
git at cryptech.is
git at cryptech.is
Sat Mar 14 18:18:39 UTC 2020
This is an automated email from the git hooks/post-receive script.
paul at psgd.org pushed a change to branch master
in repository core/math/modexpng.
at 465cdf0 Moved modexpng from user/shatov to core/math.
This branch includes the following new commits:
new 6194d48 Randomized test vector generation scripts for ModExpNG.
new ca52eb0 Don't track the test vector itself.
new c01d11d Instructions on how to use the vector generation scripts.
new eb859f8 Added blinding support to test vector generation scripts.
new f1a8087 Updated readme file.
new 3ef1813 ModExpNG ("Next Generation") math model.
new ecbc1b7 Added blinding into math model.
new c30c0bd Mutate blinding tuple.
new 701e3f1 Added optional output of intermediate quantities for debugging. Reworked index rotation code for better readability.
new 711ffbd Simplified index calculation and accumulator clearing logic. Better debug printout of accumulators.
new b5a8b52 * more debugging output * more precise modelling of DSP slice
new 766bb93 Rewrote "square" recombination to match how it works in hardware.
new a105c87 Same changes for "triangle" multiplication phase as for the "square" one (debugging output, simpler MAC clearing and index rotation logic).
new ac6bc69 Cosmetic fixes.
new aaf45e2 Removed some boilerplate code, all the three multiplication flavours are now working consistently. Still need to rework recombination routines.
new e79b4bb Fixed 4096-bit test vector generation.
new 345be75 Intermediate version to fix recombinaton overflow bug.
new c165ddc * Added more debugging options: - intentionally trigger internal overflow handler - dump MAC inputs - dump intermediate numbers during the reduction phase
new 66be583 * Started conversion of the model to use micro-operations
new a5200cd * Added more micro-operations
new b0fb263 * MASSIVE CLEANUP
new 0beee22 * More cleanup (got rid of .wide. and .narrow.)
new ec07464 Moved to "modexpng_fpga_model" repo, this one was meant for Verilog.
new 29fb6af Started working on the pipelined Montgomery modular multiplier. Currently can do the "square" part of the multiplication, i.e. compute the twice larger intermediate product AB = A * B.
new 9e9689d Further work on the Montgomery modular multiplier. Can now to the "triangular" part of multiplication, i.e. compute the "magic" reduction coefficient Q = LSB(AB) * N_COEFF.
new ecf0374 Further work on the Montgomery modular multiplier. Added the third "rectangular" stage of the multiplication process, i.e. computation of how many copies of the modulus N to add to the intermediate product AB to zeroize the lower half: M = Q * N.
new 3ea94c8 Implemented the final stage of the Montgomery modular multiplication, i.e. addition of AB and M then reduction by right-shift.
new fde62e3 Major rewrite (different core hierarchy, buses, wrappers, etc).
new 71f7025 Redesigned core architecture, unified bank structure. All storage blocks now have eight 4kbit entries and occupy one 36K BRAM tile.
new 0b4b42d Redesigned storage modules, added top-level module, added I/O storage space.
new affada8 Reworked storage architecture (moved I/O memory to a separate module, since there's only one instance of input/output values, while storage manager has dual storage space for P and Q multipliers).
new 8ee5a19 Expanded micro-operation parameters (added dedicated control bit to force the B input of the modular multiplier to 1, this is necessary to bring numbers out of Montgomery domain).
new e340b14 Added more micro-operations, also added "general worker" module. The worker is basically a block memory data mover, but it can also do some supporting operations required for the Garner's formula part of the exponentiation.
new 0224778 Added more micro-operations, entire Montgomery exponentiation ladder works now.
new 1e33032 Refactored general worker module Added modular subtraction micro-operation
new 3213b3e Added "MERGE_LH" micro-operation. To be able to do Garner's formula we need regular (not modular) multiplication. We're doing this by telling the modular multiplier to stop after the "square" step, which computes A*B. The problem is that the multiplier stores the lower part of the product in the internal bank L and the upper part in the internal bank H, but we need to be able to do operations on the product as a whole. MERGE_LH that combines the two halves of the product [...]
new 3633901 Added the regular (not modular) addition operation required during the final step of the Garner's formula algorithm. Note, that the addition is "uneven" in the sense, that the first operand is full-size (as wide as the modulus), while the second one is only half the size. The adder internally banks the second input port during the second half of the addition.
new 9eac252 Entire CRT signature algorithm works by now.
new 72902f5 Redesigned the testbench. Core clock does not necessarily need to be twice faster than the bus clock now. It can be the same, or say four times faster.
new 69b5d9f Added support for non-CRT mode. Further refactoring.
new 584393a Further work: - added core wrapper - fixed module resets across entire core (all the resets are now consistently active-low) - continued refactoring
new edd5efd Reworked testbench, clk_sys and clk_core can now have any ratio, not necessarily 1:2.
new 57d250b Fixed all the testbenches to work with the latest RTL sources.
new 657678a Added simulation-only code to measure multiplier load.
new 88f46be Fixed port width mismatch warning.
new d50bb60 Added readme file.
new 0f111bf Added demo driver code for STM32.
new 6d2cdbf Added missing copyright headers.
new 65bf054 Beautified the README.md, should look somewhat less nasty now.
new f4771a7 The uOP engine didn't compile at 180 MHz. The pipeline had two stages: FETCH and DECODE. Apparently one clock cycle is not enough to entirely decode an instruction, so decoding now takes two clock cycles (DECODE_1 and DECODE_2). This seems to solve the problem. If we run into more timing violations here, we can add an extra DECODE_3 cycle and register the currently combinatorial uop_opcode_* flags at DECODE_2. This fix increases the core's latency by 59/32 clock cycles ( [...]
new 863cac9 Refactored reductor module.
new 189dfd6 Removed the latch accidentally created while pipelining the uOP engine module. The FSM previously had four states encoded using two bits, so the next state logic didn't have a default case, since all the possible states were used. Addition of the fifth state required one more state bit, so the FSM now has five states out eight possible and a default case is thus necessary.
new 157d5de Small change to the reductor module to try to get past 180 MHz. Previously BRAM outputs were going directry into a LUT-based ternary adder which was causing timing problems. Added a layer of flip-flops, so instead of BRAM -> LUT -> FF we have BRAM -> FF -> LUT -> FF. This increases core latency by (number_of_supporting_modular_multiplications + number_of_exponent_bits) ticks.
new 59a54d6 Forgot to push minor cosmetic fix.
new 1d0458f Cosmetic fix.
new 6791175 One more cosmetic fix.
new 83f8779 Had to rework the general worker module to reach 180 MHz core clock. The module is responsible for doing certain supporting operations (mostly moving operands between banks and doing some simple math operations, such as modular subtraction and regular addition). Depending on the particular operation, one of three bank address space sweep patterns was used: * one-pass (for things like carry propagation) * two-pass (for things like modular subtraction that produce interm [...]
new 6a0438e Turns out, fabric addition and subtraction in the general worker module are actually in the critical paths of the ModExpNG core and are plaguing the place and route tools. I was barely able to achieve timing closure at 180 MHz even with the highest Map and PaR effort levels. This means that any further clock frequency increase is effectively impossible, moreover any small change in the design may prevent it from meeting timing constants. The obvious solution is to use DS [...]
new e5f4454 Reworked modular subtraction micro-operation. Previously it used "two-pass" bank address space sweep, during the first pass (a-b) and (a-b+n) were computed, during the second pass either the former or the latter quantity was written to the output bank (depending on the very last borrow flag value). This is no longer possible, since the FSM now only generates one "interleaved" address space sweep. The solution is to split one complex modular subtraction operation into sim [...]
new ab061af This commit modifies the REGULAR_ADD_UNEVEN micro-operation to use DSP slices for addition instead of fabric logic. This opcode is only necessary when in CRT mode and is executed once per entire exponentiation to recombine the two "easier" exponentiations. This was the final change necessary to get rid of using fabric math in the general worker module.
new b985d45 * DSP slices now have two use modes: MULT and ADD/SUB * cosmetic rename of Verilog include file
new 147dcd3 Removed old DSP wrappers.
new a1314f3 Added two pairs of new wrappers.
new b585d25 Cosmetic fix that only involves debug output during simulation.
new b590661 Updated microcode source to match the changes made to general worker module.
new b8c0536 Updated uOP engine to match the changes made to the general worker module (modular subtraction was split into three micro-operations instead of one).
new 6492883 For the new general worker module to work we need dynamic switching of DSP OPMODE, ALUMODE and CARRYINSEL ports, thus more defined constants.
new 8f7829c Tiny cosmetic typo fix ("dst" -> "dsp")
new 2345e42 Added more meaningful constants to avoid certain hardcoded numbers.
new c551625 Refactored modular reductor module.
new 04bc457 Renumbered micro-operations.
new 76f89d6 The I/O manager has to work in sync with the general worker module. Made the necessary changes to make it work after the general worker update. Also moved debug simulation-time code into a separate file.
new e68e11a Update DSP wrapper instance names.
new c6029af Refactored MMM recombinator module, accomodated the changes in DSP slice wrapper names.
new b0cbd33 Refactored the MMM module, now uses meaningful constant names from the include file, not hardcoded widths.
new c4bee71 Cosmetic change to easily switch tests on/off.
new 97d5b79 Bump version number.
new fa360f8 Cosmetic rename of FSM states.
new 6b335ce * more consistent port names * optional two-stage pipeline for A&B ports
new 6093e06 Accomodate the changes to DSP slice wrappers.
new d2ae99a This commit accomodates the changes made to DSP slice wrappers and also fixes the subtle math overflow bug introduced while switching to DSP-based partial multiplication product recombination.
new 9217682 Uniform testbenches.
new bc4eb54 Better handling of debug output (no need to manally adjust word count anymore).
new c5d2101 Updated DSP slice wrappers for the new partial product recombination algorithm: - unified clock enable for A:B and C ports - A:B and C ports now always have fixed 1-cycle latency - added new Z multiplexor modes in the generic wrapper
new 2500312 Added new DSP slice OPMODEs for the new recombination algorithm.
new 58ad4d9 Adapted to the changes in the DSP slice wrappers.
new f2598e0 Improved debugging options: * flush console after each ladder iteration for smoother progress output * ability to truncate internal powering ladder loop at desired step (this will only work when using simulation mode, obviously)
new 77d1148 New partial product recombination algorithm.
new 13abd96 Increment version number.
new 2791a17 More elegant way to do partial product recombination: * take advantage of the cascade paths between DSP slices * decrease latency of operation
new d35ff84 Update STM32 demo driver.
new 465cdf0 Moved modexpng from user/shatov to core/math.
The 92 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
More information about the Commits
mailing list