[Cryptech Core] profiling

Joachim Strömbergson joachim at assured.se
Thu Dec 6 13:07:07 UTC 2018
Previous message (by thread): [Cryptech Core] profiling
Next message (by thread): [Cryptech Core] profiling
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Aloha!

Thank you for a very good writeup of profiling, and very interesting
results.

Looking at 4.1, it seems that we really should look at ways of improving
performance for the functions in libtfm. I've looked at the libtfm code,
and it seems not to use the FPU at all. Also, I have not been able to
find that we actually never enable the FPU (which one needs to do.

Just counting cycles for the square root part of the code for
fp_sqrmod(), there seem to be 3x more cycles than the FPU instruction
VSQRT.F32 (14 cycles).

We could also look moving these operations into the FPGA. We would get
more bus accesses, but may be able to gain performance. Maybe.

Looking at ways of more intelligently load the KEK seems like a
promising lead too. Esp if combined with ensuring that it isn't even
loaded into the MCU, just the FPGA.

Regards,
JoachimS


On 2018-12-04 05:09, Paul Selkirk wrote:
> 0. Intro - what profiling is
> 
> Profiling has two pieces:
> 
> * Compiling with `-pg` automatically inserts entry logging into
> every function. This is used to a) count the number of times a function
> is called, and b) create a call tree - which function calls which other
> function(s). These counts, being deterministic, are pretty accurate.
> 
> * Program counter (instruction pointer) sampling. We have a SysTick
> interrupt that fires every millisecond. At that time, in addition to
> regular house-keeping, the profiler records which function it
> interrupted, and which function it was called from. The gross assumption
> is that the processor spends the whole millisecond in that function.
> This is clearly wrong on the micro-level, but, over the course of 1000
> RSA signatures, it's statistically useful.
> 
> 1. How to build
> 
> To build a firmware image with profiling, go to `sw/stm32`, and run
> `make distclean DO_PROFILING=1 hsm`. The `distclean` step clears out any
> previous object files, and ensures that the entire code base will be
> built with profiling enabled.
> 
> 2. How to run
> 
> To generate a profiling report on, say RSA signing, you have to go
> through a number of steps, which are detailed in
> `sw/stm32/libraries/libprof/README.md`. Fortunately, Rob has automated
> all this into a script, so the complete invocation would look something
> like this:
> 
> `./libraries/libprof/profile-runner.py --gprof keywrap.gprof --hsm-elf
> ./projects/hsm/hsm.elf --openocd-config
> /usr/share/openocd/scripts/board/stm32f4discovery.cfg python
> ../libhal/tests/parallel-signatures.py -c 1`
> 
> This starts the profiler, generates 1000 RSA signatures, stops the
> profiler, and writes the profiling report to `keywrap.gprof`.
> 
> Because of the statistical nature of the PC sampling, it's useful to do
> several runs in a row, and compare the results for consistency. So the
> invocation would look like:
> 
> `for i in {0..3}; do <profile-runner as above, with --gprof
> foo.$i.gprof>; done`
> 
> Note: The first time a key is used, there are some pre-calc and blinding
> factor computations, which take *much* longer than subsequent uses (e.g.
> 1.09s vs 0.16s). For that reason, I always start out with a priming run
> before profiling, e.g. `../libhal/tests/parallel-signatures.py -c 1 -i 1`
> 
> 3. How to read a profiling report
> 
> The first section ("Flat profile") is from the PC sampler; it shows how
> much time was spent in each function.
> 
> First we have something like this:
>  40.38     96.32    96.32                             _mcount_internal
>  19.02    141.69    45.37                             __gnu_mcount_nc
> 
> These is the function-entry counting part of the profiler itself. The
> firmware contains a lot of short functions, especially in the bignum
> library, but we have to count them all. This is an example of how the
> act of profiling distorts the operation of the system, and there's just
> no getting around that.
> 
> Then we have something like this:
>   6.92    158.21    16.51 15832758     0.00     0.00  hal_io_read
>   5.20    170.62    12.41  1869115     0.00     0.00  memset
>   3.15    178.15     7.53   289469     0.00     0.00  s_fp_sub
> 
> This is where the program actually spends its time in RSA signing.
> 
> The second section ("Call graph") is based on the entry logging,
> augmented with estimated call times from the PC sampler.
> 
> To understand what the program is actually doing, we read down until we
> get to something like this:
> 
>                 0.00   93.26    1000/1000        hal_rpc_pkey_sign [5]
> [6]     39.1    0.00   93.26    1000         pkey_local_sign [6]
>                 0.00   48.39    1000/1000        hal_ks_fetch [7]
>                 0.00   44.86    1000/1000        pkey_local_sign_rsa[11]
> 
> The lines above the named function are its callers. The lines below are
> functions that it calls. Here we see that pkey_local_sign divides its
> time about evenly between hal_ks_fetch (fetch and unwrap the key) and
> pkey_local_sign_rsa (sign the message).
> 
> Drilling down, we find that hal_ks_fetch spends most of its time in
> hal_aes_keyunwrap, and pkey_local_sign_rsa spends most of its time in
> hal_rsa_decrypt. And so on down the line.
> 
> OTOH, to look at things from the bottom up, e.g. to understand why we
> spend so much time in hal_io_read, we find this:
> 
>                 0.00    0.00       4/15832758     hal_get_random [128]
>                 0.00    0.00    2256/15832758     hal_modexp2 [63]
>                 0.01    0.00    9018/15832758     hal_mkmif_read [70]
>                 0.07    0.02   64000/15832758     get_buffer [79]
>                 3.55    0.83 3399672/15832758     do_block [9]
>                12.89    3.02 12357808/15832758    hal_io_wait2 [15]
> [18]     8.5   16.51    3.86 15832758         hal_io_read [18]
>                 3.86    0.00 38464860/47514100    HAL_GPIO_ReadPin [37]
> 
> The collective time spent in hal_io_read is the "self" time (16.51s)
> plus the "children" time (3.86s). The amount of time due to hal_io_wait2
> is 12M calls out of 15M total, or 78%.
> 
> It's important to note that these are just estimates, assuming that each
> call to hal_io_read (or each call to hal_io_wait2) takes the same amount
> of time. This will come back to bite us later.
> 
> 4. Findings
> 
> We will be looking at RSA signing in 3 configurations:
> - master (firmware built on the stm32 and libhal 'master' branch,
> bitstream from the September 7 cryptech-alpha-firmware Debian
> distribution tarball)
> - fmc_clk_60mhz (stm32 'fmc_clk_60mhz' branch, libhal 'master',
> bitstream from the Sept 7 cryptech-alpha-firmware_fmc-clk-60mhz Debian
> tarball)
> - keywrap (stm32 'master', libhal 'auto_magic', bitstream built with
> /user/js/keywrap 'auto_magic' branch)
> 
> First, to establish a baseline, here's the relative performance with
> unprofiled firmware. The numbers here are seconds per signature, as the
> median of a run of 5000 signatures.
> 
> master          0.172852
> fmc_clk_60mhz   0.152779
> keywrap         0.1555425
> 
> We can see that both fmc_clk_60mhz and keywrap are faster, but they get
> there in different ways (which can be combined, though I haven't tried
> it yet).
> 
> Next I did 4 profiling runs of 1000 signatures each, although I'm only
> giving you one of each.
> 
> 4.1 master.gprof
> 
> As noted above, nearly 60% of the runtime is attributed to the profiling
> code itself. However, subtracting that 141.69s from the 238.55s runtime,
> we get a net runtime of 98.86s, or 0.0986s/signature. This is almost
> twice as fast as the unprofiled speed. So right away we see that a 1khz
> PC sampler can't give us an entirely accurate picture of what's going
> on. This is more like Plato's cave, and less like a crystal ball. That
> said, we can extract some meaningful information if we know what we're
> looking for.
> 
> First, recall that signing is roughly evenly divided between fetching
> the key and generating the signature.
> 
> [6]     39.1    0.00   93.26    1000         pkey_local_sign [6]
>                 0.00   48.39    1000/1000        hal_ks_fetch [7]
>                 0.00   44.86    1000/1000        pkey_local_sign_rsa[11]
> 
> We'll start by looking at hal_ks_fetch:
> 
> [7]     20.3    0.00   48.39    1000         hal_ks_fetch [7]
>                 1.09   46.95    1000/1000        hal_aes_keyunwrap [8]
>                 0.00    0.28    1000/1002        hal_mkm_get_kek [71]
> 
> [8]     20.1    1.09   46.95    1000         hal_aes_keyunwrap [8]
>                 1.94   44.93 1697220/1699836     do_block [9]
>                 0.00    0.07    1000/2007        memmove [76]
> 
> [9]     19.7    1.94   45.00 1699836         do_block [9]
>                 0.49   35.71 1699836/1708858     hal_io_wait [14]
>                 3.61    0.82 5099508/5642554     hal_io_write [36]
>                 3.55    0.83 3399672/15832758    hal_io_read [18]
> 
> So hal_ks_fetch is dominated by hal_aes_keyunwrap, which is almost
> entirely communication with the AES core. It's also by far the biggest
> contributor to time spent in hal_io_wait (and hal_io_wait is the biggest
> contributor to hal_io_read).
> 
> hal_mkm_get_kek scores low in time, because it comes down to a relative
> handful of calls to hal_io_write, hal_io_wait, and hal_io_read. Remember
> that gprof apportions each function's share of hal_io_wait time based on
> the the number of times each function called hal_io_wait, regardless of
> how long it might have actually spent there. I have reason to believe
> that hal_mkm_get_kek might actually be spending as much as 3 times the
> reported time of 0.28s, but it's still pretty small beer.
> 
> Now we'll look at pkey_local_sign_rsa:
> 
> [11]    18.8    0.00   44.86    1000         pkey_local_sign_rsa [11]
>                 0.01   39.39    1000/1000        hal_rsa_decrypt [12]
> 
> [12]    16.5    0.01   39.39    1000         hal_rsa_decrypt [12]
>                 0.00   36.97    1000/1000        rsa_crt [13]
>                 0.00    1.92    1000/6000        unpack_fp [23]
>                 0.05    0.44    1000/13001       fp_read_unsigned_bin
>                 0.01    0.00    2000/1869115     memset [22]
> 
> [13]    15.5    0.00   36.97    1000         rsa_crt [13]
>                 0.00   15.15    3000/3000        fp_mulmod [19]
>                 0.01   11.31    1000/1000        modexp2 [25]
>                 0.00    9.97    1000/1000        create_blinding_factors
>                 0.00    0.40    1000/4000        fp_mul [44]
> 
> [25]     4.7    0.01   11.31    1000         modexp2 [25]
>                 0.00    9.61    5000/6000        unpack_fp [23]
>                 0.09    0.88    2000/13001       fp_read_unsigned_bin
>                 0.10    0.58    1000/1000        hal_modexp2 [63]
>                 0.05    0.00    7000/1869115     memset [22]
>                 0.00    0.01    5000/13031       fp_unsigned_bin_size
> 
> [63]     0.3    0.10    0.58    1000         hal_modexp2 [63]
>                 0.37    0.08  522004/5642554     hal_io_write [36]
>                 0.00    0.08    2000/2000        get_buffer [79]
>                 0.00    0.02    1001/1709859     hal_io_wait2 [15]
>                 0.00    0.01    1000/1000        hal_core_alloc2 [101]
>                 0.00    0.00    2000/3003        hal_core_free [100]
>                 0.00    0.00    2256/15832758    hal_io_read [18]
> 
> [27]     4.2    0.00    9.97    1000         create_blinding_factors
>                 0.00    9.83    2000/2000        fp_sqrmod [28]
> 
> [33]     2.2    0.00    5.33    1000        hal_rsa_private_key_from_der
>                 0.03    4.47    9000/10001       hal_asn1_decode_integer
> 
> Adding all this up, pkey_local_sign_rsa is 44.86s, of which 5.33s is
> ASN.1 decoding, only 0.68s is hal_modexp2, and a whopping 38.38s is
> libtfm (the rest of CRT, and blinding factor permutation; featuring
> fp_mulmod, unpack_fp, fp_sqrmod, fp_read_unsigned_bin, fp_mul,
> fp_unsigned_bin_size, in order of contribution).
> 
> 4.2 fmc_clk_60mhz.gprof
> 
> This differs only in the clock speeds of the FPGA. Recall that, in
> master, the cores are clocked at 50mhz, and the FMC bus is clocked at
> 90mhz, and has to do some clock domain crossing magic. In the 60mhz
> variant, both the cores and the bus are clocked at 60mhz. This also
> means that the software driver no longer needs to poll the NWAIT pin on
> read, which may offset the slower bus speed.
> 
> [5]     43.8    0.00   83.30    1000         pkey_local_sign [5]
>                 0.00   45.69    1000/1000        pkey_local_sign_rsa [7]
> 		0.00   37.60    1000/1000        hal_ks_fetch [10]
> 
> It makes sense that signing takes about the same time, since it's mostly
> on the MCU. hal_ks_fetch is faster, because of the slightly faster cores
> and the synchronous bus.
> 
> 4.3 keywrap.gprof
> 
> This variant moves the AES key wrap and unwrap operations into a
> dedicated keywrap core. 1699836 calls to do_block are replaced with 1002
> calls to do_keywrap_core. Internally, just as much time is spent in AES,
> but it eliminates almost all the time spent communicating with the FPGA.
> 
> (This uses the same FPGA clocking as master, so that's what we need to
> compare to here.)
> 
> [5]     41.8    0.00   71.01    1000         pkey_local_sign [5]
>                 0.00   47.23    1000/1000        pkey_local_sign_rsa [7]
>                 0.00   23.77    1000/1000        hal_ks_fetch [12]
> 
> On the face of it, this looks okay. It's spending less time in
> hal_ks_fetch, and...a little more time in pkey_local_sign_rsa? Drilling
> down (not shown here), we find that hal_modexp2 spends 0.02s in
> hal_io_wait on master, but 2.36s here.
> 
> Even more surprising, if we drill down the hal_ks_fetch call chain, we
> find that time spent in hal_mkmif_read goes from 0.29s on master to
> 21.30s here. What?
> 
> What's going on here has everything to do with hal_io_wait:
> 
>                 0.13    2.23    1001/11025       hal_modexp2 [39]
>                 1.29   22.35   10024/11025       hal_io_wait [13]
> [11]    15.3    1.42   24.58   11025         hal_io_wait2 [11]
> 
>                 0.00    0.01       4/10024       hal_get_random [100]
>                 0.00    2.36    1002/10024       do_keywrap_core [42]
>                 0.00   21.26    9018/10024       hal_mkmif_read [16]
> [13]    13.9    0.00   23.63   10024         hal_io_wait [13]
> 
> Time spent in hal_io_wait2 is apportioned by number of calls, but
> do_keywrap_core spends a *lot* more time in each hal_io_wait call.
> 
> This can also be seen in hal_io_read: In master, 1.7M calls to
> hal_io_wait2 are responsible for 12.3M calls to hal_io_read. In keywrap,
> just 11025 calls to hal_io_wait2 are responsible for 10.9M calls to
> hal_io_read.
> 
> Undeniably, less time is spent in hal_io_wait2 (26.00s vs 35.93s on
> master), but a lot more of that time is from hal_aes_keyunwrap, and a
> lot less from hal_mkm_get_kek. Without a clear understanding of that, we
> could waste time optimizing the wrong things.
> 
> 4.4 keywrap.cached-kek.gprof
> 
> To investigate and illustrate this, I followed a line of thought that
> was brought up at last week's meeting. Joachim pointed out that, after a
> key wrap/unwrap operation, the KEK remains in the AES (or keywrap) core.
> So I modified ks.c and aes_keywrap.c to only fetch the KEK on the first
> operation. The speed improvements were very modest (about 0.8ms/sig
> unprofiled), but it really shifted the time assigned to key-fetch and
> signing:
> 
> [4]     41.2    0.00   69.68    1000         pkey_local_sign [4]
>                 0.00   57.30    1000/1000        pkey_local_sign_rsa [7]
>                 0.00   12.37    1000/1000        hal_ks_fetch [19]
> 
>                 1.83   10.41    1001/2007        hal_modexp2 [16]
>                 1.84   10.46    1006/2007        hal_io_wait [21]
> [11]    14.5    3.68   20.86    2007         hal_io_wait2 [11]
> 
>                 0.00    0.05       4/1006        hal_get_random [84]
>                 0.00   12.25    1002/1006        do_keywrap_core [22]
> [21]     7.3    0.00   12.30    1006         hal_io_wait [21]
> 
> There are only 2007 calls to hal_io_wait. hal_modexp2 and
> do_keywrap_core are responsible for about the same number of calls, so
> get assigned about the same amount of time.
> 
> So even in this scenario, we should assign most of hal_io_wait to
> do_keywrap_core. Realistically, I'd expect something like:
> 
>                 0.00   45.30    1000/1000        pkey_local_sign_rsa [7]
>                 0.00   24.37    1000/1000        hal_ks_fetch [19]
> 
> Compare to master:
> 
>                 0.00   44.86    1000/1000        pkey_local_sign_rsa[11]
>                 0.00   48.39    1000/1000        hal_ks_fetch [7]
> 
> 5. Conclusions
> 
> Profiling is a necessary tool for understanding where the program is
> spending its time. But it's necessary to understand all the factors that
> go into the seemingly-precise numbers, in order to have a clear(er)
> picture of what's actually going on.
> 
> Both `fmc_clk_60mhz` and `js_keywrap` show promise for speed
> improvements, but fine-grained FPGA communications and software bignum
> math continue to dominate signing time.
> 
> 				paul
> 
> 
> _______________________________________________
> Core mailing list
> Core at cryptech.is
> https://lists.cryptech.is/listinfo/core
> 


-- 
Med vänlig hälsning, Yours

Joachim Strömbergson
========================================================================
                               Assured AB
========================================================================

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: OpenPGP digital signature
URL: <https://lists.cryptech.is/archives/core/attachments/20181206/44d2ee8b/attachment.sig>
Previous message (by thread): [Cryptech Core] profiling
Next message (by thread): [Cryptech Core] profiling
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Core mailing list