blog.robur.coop/articles/miragevpn-performance.md

4.7 KiB

date article.title article.description tags author coauthors
2024-04-16 Speeding up MirageVPN and use it in the wild How we improved the performance of MirageVPN
OCaml
MirageOS
cryptography
security
VPN
name email link
Hannes Mehnert hannes@mehnert.org https://hannes.robur.coop
name email link
Reynir Björnsson reynir@reynir.dk https://reyn.ir/

TODO: how to specify multiple authors? Is this possible? Use coauthors!

As we were busy continuing to work on MirageVPN, we got in touch with eduVPN, who are interested about deploying MirageVPN. We got example configuration from their side, and fixed some issues, and also implemented tls-crypt - which was straightforward since we earlier spend time to implement tls-crypt-v2.

In January, they gave MirageVPN another try, and measured the performance -- which was very poor -- MirageVPN (run as a Unix binary) provided a bandwith of 9.3Mb/s, while OpenVPN provided a bandwidth of 360Mb/s (using a VPN tunnel over TCP).

We aim at spending less resources for computing, thus the result was not satisfying for us. We re-read a lot of code, refactored a lot, and are now at ~250Mb/s.

Performance engineering

For tooling, we used, apart from code reading, the Linux utility perf together with Flamegraph to graph its output. This works nicely with OCaml programs (we're using the 4.14.1 compiler and runtime system). We did the performance engineering on Unix binaries, i.e. not on MirageOS unikernels - but the MirageVPN protocol is used in both scenarios - thus the performance improvements described here are also in the MirageVPN unikernels.

The learnings of our performance engineering are in three areas:

  • Formatting strings is computational expensive -- thus if in an error case a hexdump is produced of a packet, its construction must be delayed for when the error case is executed (we have this PR and that PR). Alain Frisch wrote a nice blog post at LexiFi about performance of Printf and Format.
  • Rethink allocations: fundamentally, only a single big buffer (to be send out) for each incoming packet should be allocated, not a series of buffers that are concatenated (see this PR and that PR). Additionally, not zeroing out the just allocated buffer (if it is filled with data anyways) removes some further instructions (see this PR). And we figured that appending to an empty buffer nevertheless allocated and copied in OCaml, so we worked on this PR.
  • Still an open topic is: we are in the memory-safe language OCaml, and we sometimes extract data out of a buffer (or set data in a buffer). Now, each operation lead to bounds checks (that we do not touch memory that is not allocated or not ours). However, if we just checked for the buffer being long enough (either by checking the length, or by allocating a specific amount of data), these bounds checks are superfluous. So far, we don't have an automated solution for this issue, but we are discussing it in the OCaml community, and are eager to find a solution to avoid unneeded computations.

To guide the performance engineering, we also developed a microbenchmark using OCaml tooling. This will setup a client and server without any input and output, and transfer data.

To conclude: we already achieved a factor of 25 in performance by adapting the code in various ways. We have ideas to improve the performance even more in the future - we also work on using OCaml string and bytes, instead of off-the-OCaml-heap-allocated bigarrays (see our previous article, which provided some speedups).

We want to thank NLnet for their funding (via NGI assure), and eduVPN for their interest.