[Cryptech Tech] EIM: mostly working

Bernd Paysan bernd at net2o.de
Thu Nov 6 22:53:43 UTC 2014


Am Donnerstag, 6. November 2014, 16:55:53 schrieb Paul Selkirk:
> On 11/06/2014 03:27 AM, Joachim Strömbergson wrote:
> > But do you get timing closure for these cores @133 MHz. I would have
> > excected them to not be able to be clocked that high in the Spartan-6.
> > What does the timing report says?
> > 
> > (1) Ensure that all important timings are met. If not, signals will not
> > have the correct, expected values. Can you extract the final timimg
> > report?
> 
> I've attached build logs for both the 133MHz and 25MHz clocks. As
> expected, there are timing differences.
> 
> Even clocking the coretest circuit at 25MHz, there are still reported
> timing errors around the EIM code. Bunnie says:
>    ///// It's tricky to change code that relates to
>    ///// the details of talking to the i.MX6 EIM -- it's difficult to
>    ///// close timing, so you'll need to understand a bit about FPGA timing
>    ///// closure to make chages to that section
> 
> And I don't understand FPGA timing closure.
> 
> > (2) Ensure that the 16-32 bit and big endian-little endian-converion
> > works as expected. How about adding a 32-bit register, write to it from
> > EIM and the extract it via I2C. Since you have I2C access working you
> > should be able to see if the conversion works as expected. That is we
> > end up with the expected values in the FPGA.
> > 
> > (3) Properly debug the EIM interface and the SW interface including
> > setup with something simple. Possible a single hash core or something
> > even simpler. An adder, multiplier etc with extra ports for debugging.
> > Then we add one of the hash cores and then move towards the complete
> > shebang.
> 
> Given that all 3 of the hash cores produce correct digests for single
> and double block tests most of the time, I'm pretty confident in both of
> these things. But I think we're riding the edge of our timing
> tolerances, and it's biting us most often in the 1000-block test (while
> the other tests hash 18 blocks in total).

SHA-1 and the related SHA-2 has in total 4 dependent 32/64 bit additions per 
iteration (SHA-2-512 has 64 bit adders). On average, with random inputs and 
outputs, the likelyhood of a carry propagation of n bits is 1/2^n for every 
bit added.  That means, short tests will more likely run fine at higher speed 
than longer tests, and the chance that you actually get a 64 bit carry 
propagation is very low, unless the designer of the test has done the effort 
to find the rare case where this is actually needed.

There's a good reason why Keccak chose not to use adders as non-linear 
elements to avoid that delay in a hardware implementation - the AND-based non-
linear element in Keccak has a constant and short gate delay.  If you look at 
other finalists like SKEIN or BLAKE, they all have 64 bit adders in them, 
which is fine if you do it on a 64 bit CPU, but not so good if you do it in 
hardware.

The important results of the timing analysis is that:

Timing constraint: Default period analysis for Clock 'EIM_BCLK'
  Clock period: 11.468ns (frequency: 87.201MHz)

The critical path is through the carry chain of a 64 bit adder in the sha512 
block.  So SHA-1 is probably fine, but SHA-2-512 isn't.  This is limited by 
the 63 carries the signal has to propagate through; the carries are fast, and 
the placement is good, so the only thing you can do about this is to make 
sha512 take 2 cycles per add, putting it on a tick/tock clock which is 
disabled every second cycle, but still fully synchronous to the rest of the 
chip.

The next problem seems to be addressing; there's a significant routing delay.  
You can probably solve that by providing local address latches, which make a 
read take two cycles, with the global propagation delay in the first cycle, 
and the decode and mux delay in the second cycle.

Another timing problem is in the synchronous reset, again here the routing 
delay is the culprit. That's one reason why I recommend async reset and 
delayed clock enable even on an FPGA (the more important is that you can use 
the same code on an ASIC, where async reset is more commonly used than sync 
reset).  You can completely take that timing problem out, by having a small 
down-counter from reset which waits for n cycles (4, 8) until it allows the 
clock to pass.

-- 
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://bernd-paysan.de/



More information about the Tech mailing list