blog.robur.coop/articles/speeding-ec-string.md

98 lines
No EOL
9.4 KiB
Markdown

---
date: 2024-02-13
article.title: Speeding elliptic curve cryptography
article.description:
How we improved the performance of elliptic curves by only modifying the underlying byte array
tags:
- OCaml
- MirageOS
- cryptography
- security
author:
name: Hannes Mehnert
email: hannes@mehnert.org
link: https://hannes.robur.coop
---
TL;DR: replacing cstruct with string, we gain a factor of 2.5 in performance.
## Mirage-crypto-ec
In April 2021 We published our implementation of [elliptic curve cryptography](https://hannes.robur.coop/Posts/EC) (as `mirage-crypto-ec` opam package) - this is DSA and DH for NIST curves P224, P256, P384, and P521, and also Ed25519 (EdDSA) and X25519 (ECDH). We use [fiat-crypto](https://github.com/mit-plv/fiat-crypto/) for the cryptographic primitives, which emits C code that by construction is correct (note: earlier we stated "free of timing side-channels", but this is a huge challenge, and as [reported by Edwin Török](https://discuss.systems/@edwintorok/111925959867297453) likely impossible on current x86 hardware). More C code (such as `point_add`, `point_double`, and further 25519 computations including tables) have been taken from the BoringSSL code base. A lot of OCaml code originates from our TLS 1.3 work in 2018, where Etienne Millon, Nathan Rebours, and Clément Pascutto interfaced [elliptic curves for OCaml](https://github.com/mirage/fiat/) (with the goal of being usable with MirageOS).
The goal of mirage-crypto-ec was: develop elliptic curve support for OCaml & MirageOS quickly - which didn't leave much time to focus on performance. As time goes by, our mileage varies, and we're keen to use fewer resources - and thus fewer CPU time and a smaller memory footprint is preferable.
## Memory allocation and calls to C
OCaml uses managed memory with a generational copying collection. To safely call a C function at any point in time when the arguments are OCaml values (memory allocated on the OCaml heap), it is crucial that while the C function is executed, the arguments should stay at the same memory location, and not being moved by the GC. Otherwise the C code may be upset retrieving wrong data or accessing unmapped memory.
There are several strategies to achieve this, ranging from "let's use another memory area where the GC doesn't mess around with", "do not run any GC while executing the C code" (read further in the OCaml [cheaper C calls](https://v2.ocaml.org/releases/4.14/htmlman/intfc.html#ss:c-direct-call) manual), "deeply copy the arguments to a non-moving memory area before executing C code", and likely others.
For our elliptic curve operations, the C code is pretty simple - there are no memory allocations happening in C, neither are exceptions raised. Also, the execution time of the code is constant and pretty small.
## ocaml-cstruct
In the [MirageOS](https://mirage.io) ecosystem, a core library is [cstruct](https://github.com/mirage/ocaml-cstruct) - which purpose is manifold: provide ppx rewriters to define C structure layouts in OCaml (getter/setter functions are generated), as well as enums; also a fundamental idea is to use OCaml bigarray which is non-moving memory not allocated on the OCaml heap but directly by calling `malloc`. The memory can even be page-aligned, as required by some C software, such as Xen. Convenient functionality, such as "retrieve a big-endian unsigned 32 bit integer from offset X in this buffer" are provided as well.
But there's a downside to it - as time moves along, Xen is no longer the only target for MirageOS, and other virtualization mechanisms (such as KVM / virtio) do not require page-aligned memory ranges that are retained at a given memory address. It also turns out that cstruct spends a lot of time in bounds checks. Another huge downside is that OCaml tooling (such as statmemprof) was for a long time (maybe still is not?) unaware of out-of-OCaml-GC allocated memory (cstruct uses bigarray as underlying buffer). Freeing up the memory requires finalizers to be executed - after all pretty tedious (expensive) and against the OCaml runtime philosophy.
As time moves forward, also the OCaml standard library got support for (a) strings are immutable byte vectors now (since 4.06 - released in 2017 -- there's as well an interface for mutable/immutable cstruct, but that is not used as far as I can tell), (b) retrieve a certain amount of octets in a string or byte as (unsigned) integer number (since 4.08 - released in 2019, while some additional functionality is only available in 4.13).
Still, bigarrays are necessary in certain situations - if you need to have a non-moving (shared) area of memory, as in the Xen interface, but also e.g. when you compute in parallel in different processes, or when you need mmap()ed files.
## Putting it together
Already in October 2021, Romain [proposed](https://github.com/mirage/mirage-crypto/pull/146) to not use cstruct, but bytes for mirage-crypto-ec. The PR was sitting around since there were benchmarks missing, and developer time was small. But recently, Virgile Robles [proposed](https://github.com/mirage/mirage-crypto/pull/191) another line of work to use pre-computed tables for NIST curves to speed up the elliptic curve cryptography. Conducting performance evaluation resulted that the "use bytes instead of cstruct" combined with pre-computed tables made a huge difference (factor of 6) compared to the latest release.
To ease reviewing changes, we decided to focus on landing the "use bytes instead of cstruct" first, and gladly Pierre Alain had already rebased the existing patch onto the latest release of mirage-crypto-ec. We also went further and use string where applicable instead of bytes. For safety reasons we also introduced an API layer which (a) allocates a byte vector for the result (b) calls the primitive, and \(c) transforms the byte vector into an immutable string. This API is more in line with functional programming (immutable values), and since allocations and deallocations of values are cheap, there's no measurable performance decrease.
All the changes are internal, there's no external API that needs to be adjusted - still there's at the API boundary one conversion of cstruct to string (and back for the return value) done.
We used `perf` to construct some flame graphs (of the ECDSA P256 sign), shown below.
![Flamegraph of ECDSA sign with cstruct](../images/trace-cstruct-440.svg)
The flame graph of P256 ECDSA sign using the mirage-crypto release 0.11.2. The majority of time is spent in "do_sign", which calls `inv` (inversion), `scalar_mult` (majority of time), and `x_of_finite_point_mod_n`. The scalar multiplication spends time in `add`, `double` and `select`. Several towers starting at `Cstruct.create_919` are visible.
With PR#146, the flame graph looks different:
![Flamegraph of ECDSA sign with string](../images/trace-string-770.svg)
Now, the allocation towers do not exist anymore. The time of a sign operation is spend in `inv`, `scalar_mult`, and `x_of_finite_point_mod_n`. There's still room for improvements in these operations.
## Performance numbers
All numbers were gathered on a Lenovo X250 laptop with a Intel i7-5600U CPU @ 2.60GHz. We used OCaml 4.14.1 as compiler. The baseline is OpenSSL 3.0.12. All numbers are in operations per second.
NIST P-256
| op | 0.11.2 | PR#146 | speedup | OpenSSL | speedup |
| - | - | - | - | - | - |
| sign | 748 | 1806 | 2.41x | 34392 | 19.04x |
| verify | 285 | 655 | 2.30x | 12999 | 19.85x |
| ecdh | 858 | 1785 | 2.08x | 16514 | 9.25x |
Curve 25519
| op | 0.11.2 | PR#146 | speedup | OpenSSL | speedup |
| - | - | - | - | - | - |
| sign | 10713 | 11560 | 1.08x | 21943 | 1.90x |
| verify | 7600 | 8314 | 1.09x | 7081 | 0.85x |
| ecdh | 12144 | 13457 | 1.11x | 26201 | 1.95x |
Note: to re-create the performance numbers, you can run `openssl speed ecdsap256 ecdhp256 ed25519 ecdhx25519` - for the OCaml site, use `dune bu bench/speed.exe --rel` and `_build/default/bench/speed.exe ecdsa-sign ecdsa-verify ecdh-share`.
The performance improvements are up to 2.5 times compared to the latest mirage-crypto-ec release (look at the 4th column). In comparison to OpenSSL, we still lack a factor of 20 for the NIST curves, and up to a factor of 2 for 25519 computations (look at the last column).
If you have ideas for improvements, let us know via an issue, eMail, or a pull request :) We started to [gather some](https://github.com/mirage/mirage-crypto/issues/193) for 25519 by comparing our code with changes in BoringSSL over the last years.
As a spoiler, for P-256 sign there's another improvement of around 4.5 with [Virgile's PR](https://github.com/mirage/mirage-crypto/pull/191) using pre-computed tables also for NIST curves.
## The road ahead for 2024
Remove all cstruct, everywhere, apart from in mirage-block-xen and mirage-net-xen ;). It was a fine decision in the early MirageOS days, but from a performance point of view, and for making our packages more broadly usable without many dependencies, it is time to remove cstruct. Earlier this year we already [removed cstruct from ocaml-tar](https://github.com/mirage/ocaml-tar/pull/137) for similar reasons.
Our MirageOS work is only partially funded, we cross-fund our work by commercial contracts and public (EU) funding. We are part of a non-profit company, you can make a (tax-deducable - at least in the EU) [donation](https://aenderwerk.de/donate/) (select "DONATION robur" in the dropdown menu).
We're keen to get MirageOS deployed in production - if you would like to do that, don't hesitate to reach out to us via eMail team at robur.coop