[Cryptech Tech] EIM: mostly working
Paul Selkirk
paul at psgd.org
Wed Nov 5 18:21:47 UTC 2014
After entirely too much time and forehead-blood, I have coretest mostly
working with EIM on my Novena, and committed to the novena_eim repo.
What follows is a brain dump of what I know about EIM, things I've
tried, and an call for help.
"Mostly working" means it passes most of the test cases most of the
time. It fails the 1000-block sha256 test case 75% of the time, and very
occasionally fails one of the other test cases. (Of course, unless it
works 100.0% of the time, it's completely broken, so "mostly working" is
a hand-wave.) The usual failure mode is that it returns a random digest,
not just a few bits or bytes that might indicate an error reading the
digest from the core.
A note about timing: Mostly I've been clocking the coretest cores at
133MHz, because that's what the rest of novena_fpga uses. At that speed,
it consistently passes the sha1 tests, but consistently fails all the
sha256 and sha512 tests, again with random digests. It's as if it was
hitting the 'next' control rather than the 'init'. When I dialed
coretest back to 50MHz or even 25MHz (leaving the EIM circuit at
133MHz), it started passing the sha256 and sha512 one-block and
two-block tests (mostly), but still failed the sha256 1000-block tests.
(An interesting aside - while the 1000-block test produces the wrong
result, it does so very quickly, in 7ms. By contrast, on i2c clocked at
the same 25MHz, it takes 7000ms. The difference is entirely the
communications channel. In both cases, we write the block data once, hit
'init', then repeatedly poll the 'ready' status line and hit the 'next'
control line. With i2c and uart, we're writing a 5-byte 'read' command
for status, reading a 9-byte response, writing a 9-byte 'write' command
for ctrl, and reading a 5-byte response, so 28 bytes transferred over a
serial line for each iteration. With eim, we're reading 4 bytes of
status, and writing 4 bytes of ctrl, so 8 bytes over a memory-mapped
channel.)
Back to the matter at hand. EIM uses a memory-mapped interface to talk
to the FPGA. Conceptually simple enough, but it requires some serious
rubber-chicken voodoo to set up from the driver - setup_fpga() is an
incantation with 60 calls to poke magic values in magic addresses. It's
heavily commented, but not enough for me to actually understand it. So I
have to take it as a matter of faith that it's right, but I'm not
convinced. Writing individual 2- or 4-byte values to mapped memory
works, but memcpy() seems to stutter, repeating values.
EIM reads and writes are 16-bit, little-endian. This makes for a
slightly awkward fit with Joachim's coretest framework, which uses
32-bit transfers. I *think* I've got that working correctly, otherwise
none of the test cases would work, but I might be missing something subtle.
For reasons I don't understand, Bunnie muxes the EIM address and data
lines, and has an 'LBA' (latch address) line to distinguish between
them. He defines 'reg_wo' storage registers, but he doesn't actually
look at LBA there, so he ends up storing the address and then the data.
This is arguably okay for pure storage, but one of our magic memory
address is actually a control register, which triggers the hashing
operation, and the address will always have bit 1 set, which corresponds
to the 'next' control. Once I sussed that out, I was surprised that it
continued to produce random digests.
Wondering if the address bits might be lingering (he pipelines the
inputs "to allow for s/h closure", so I have to make sure I'm getting
the right version of the signal), or if the data lines might be unstable
(the Freescale EIM users guide has traces showing data transitioning
after chip-select is asserted), I added a logging circuit to store the
value and cycle count for the last couple writes. Short answer: CS is
held for 6 cycles, and data doesn't change during that time.
Wondering if 6 cycles might be a bit too long, I used Bunnie's
rising_edge module to strobe the CS line for one cycle, but that
produced garbled reads, and at least once failed to trigger 'init' (the
sha1 two-block test returned the sha1 one-block digest). Ganging two
rising_edge circuits to produce a 2-cycle CS didn't help. Which is odd,
because my reading of coretest.v (used in the UART and I2C versions) is
that it asserts CS for 2 cycles.
Yesterday's commit contains a pre-built bitfile. If you want to build it
yourself, I've included a Makefile (for Rob) and a .xise file (for
Joachim). Be warned that it takes longer to build than the i2c versions,
and it's not predictable. Command-line builds generally take 15-20
minutes on my machine, but can go to 30 minutes, and I even had one run
that took an hour. ISE builds tend to be a bit longer, and I had one run
that took 2 hours. Seemingly trivial changes can lead to dramatically
different run times. Most of the time is spent in PAR (place and route).
Joachim may have some ideas why.
Anyway, I'd appreciate any ideas about what might be wrong, or things to
try.
paul
More information about the Tech
mailing list