From 933644ce236e3a5bf927f1d7564f88e92b722940 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Reynir=20Bj=C3=B6rnsson?= Date: Tue, 16 Apr 2024 13:49:53 +0200 Subject: [PATCH 1/4] Import miragevpn performance article --- articles/miragevpn-performance.md | 43 +++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) create mode 100644 articles/miragevpn-performance.md diff --git a/articles/miragevpn-performance.md b/articles/miragevpn-performance.md new file mode 100644 index 0000000..dd25cc2 --- /dev/null +++ b/articles/miragevpn-performance.md @@ -0,0 +1,43 @@ +--- +date: 2024-04-16 +article.title: Speeding up MirageVPN and use it in the wild +article.description: + How we improved the performance of MirageVPN +tags: + - OCaml + - MirageOS + - cryptography + - security + - VPN +author: + name: Hannes Mehnert + email: hannes@mehnert.org + link: https://hannes.robur.coop +coauthors: + - name: Reynir Björnsson + email: reynir@reynir.dk + link: https://reyn.ir/ +--- + +TODO: how to specify multiple authors? Is this possible? Use coauthors! + +As we were busy continuing to work on [MirageVPN](https://github.com/robur-coop/miragevpn), we got in touch with [eduVPN](https://eduvpn.org), who are interested about deploying MirageVPN. We got example configuration from their side, and [fixed](https://github.com/robur-coop/miragevpn/pull/201) [some](https://github.com/robur-coop/miragevpn/pull/168) [issues](https://github.com/robur-coop/miragevpn/pull/202), and also implemented [tls-crypt](https://github.com/robur-coop/miragevpn/pull/169) - which was straightforward since we earlier spend time to implement [tls-crypt-v2](https://blog.robur.coop/articles/miragevpn.html). + +In January, they gave MirageVPN another try, and [measured the performance](https://github.com/robur-coop/miragevpn/issues/206) -- which was very poor -- MirageVPN (run as a Unix binary) provided a bandwith of 9.3Mb/s, while OpenVPN provided a bandwidth of 360Mb/s (using a VPN tunnel over TCP). + +We aim at spending less resources for computing, thus the result was not satisfying for us. We re-read a lot of code, refactored a lot, and are now at ~250Mb/s. + +## Performance engineering + +For tooling, we used, apart from code reading, the Linux utility [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page) together with [Flamegraph](https://github.com/brendangregg/FlameGraph) to graph its output. This works nicely with OCaml programs (we're using the 4.14.1 compiler and runtime system). We did the performance engineering on Unix binaries, i.e. not on MirageOS unikernels - but the MirageVPN protocol is used in both scenarios - thus the performance improvements described here are also in the MirageVPN unikernels. + +The learnings of our performance engineering are in three areas: +- Formatting strings is computational expensive -- thus if in an error case a hexdump is produced of a packet, its construction must be delayed for when the error case is executed (we have [this PR](https://github.com/robur-coop/miragevpn/pull/220) and [that PR](https://github.com/robur-coop/miragevpn/pull/209)). Alain Frisch wrote a nice [blog post](https://www.lexifi.com/blog/ocaml/note-about-performance-printf-and-format/#) at LexiFi about performance of `Printf` and `Format`. +- Rethink allocations: fundamentally, only a single big buffer (to be send out) for each incoming packet should be allocated, not a series of buffers that are concatenated (see [this PR](https://github.com/robur-coop/miragevpn/pull/217) and [that PR](https://github.com/robur-coop/miragevpn/pull/219)). Additionally, not zeroing out the just allocated buffer (if it is filled with data anyways) removes some further instructions (see [this PR](https://github.com/robur-coop/miragevpn/pull/218)). And we figured that appending to an empty buffer nevertheless allocated and copied in OCaml, so we worked on [this PR](https://github.com/robur-coop/miragevpn/pull/214). +- Still an open topic is: we are in the memory-safe language OCaml, and we sometimes extract data out of a buffer (or set data in a buffer). Now, each operation lead to bounds checks (that we do not touch memory that is not allocated or not ours). However, if we just checked for the buffer being long enough (either by checking the length, or by allocating a specific amount of data), these bounds checks are superfluous. So far, we don't have an automated solution for this issue, but we are [discussing it in the OCaml community](https://discuss.ocaml.org/t/bounds-checks-for-string-and-bytes-when-retrieving-or-setting-subparts-thereof/), and are eager to find a solution to avoid unneeded computations. + +To guide the performance engineering, we also developed [a microbenchmark](https://github.com/robur-coop/miragevpn/pull/230) using OCaml tooling. This will setup a client and server without any input and output, and transfer data. + +To conclude: we already achieved a factor of 25 in performance by adapting the code in various ways. We have ideas to improve the performance even more in the future - we also work on using OCaml string and bytes, instead of off-the-OCaml-heap-allocated bigarrays (see [our previous article](https://blog.robur.coop/articles/speeding-ec-string.html), which provided some speedups). + +We want to thank [NLnet](https://nlnet.nl) for their funding (via [NGI assure](https://www.assure.ngi.eu/)), and [eduVPN](https://eduvpn.org) for their interest. From 8b9aa6b18f357444b8f8d811477ee2cb5085b5f1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Reynir=20Bj=C3=B6rnsson?= Date: Tue, 16 Apr 2024 14:10:10 +0200 Subject: [PATCH 2/4] Add text about different approaches to measuring --- articles/miragevpn-performance.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/articles/miragevpn-performance.md b/articles/miragevpn-performance.md index dd25cc2..6d2d0d3 100644 --- a/articles/miragevpn-performance.md +++ b/articles/miragevpn-performance.md @@ -36,7 +36,9 @@ The learnings of our performance engineering are in three areas: - Rethink allocations: fundamentally, only a single big buffer (to be send out) for each incoming packet should be allocated, not a series of buffers that are concatenated (see [this PR](https://github.com/robur-coop/miragevpn/pull/217) and [that PR](https://github.com/robur-coop/miragevpn/pull/219)). Additionally, not zeroing out the just allocated buffer (if it is filled with data anyways) removes some further instructions (see [this PR](https://github.com/robur-coop/miragevpn/pull/218)). And we figured that appending to an empty buffer nevertheless allocated and copied in OCaml, so we worked on [this PR](https://github.com/robur-coop/miragevpn/pull/214). - Still an open topic is: we are in the memory-safe language OCaml, and we sometimes extract data out of a buffer (or set data in a buffer). Now, each operation lead to bounds checks (that we do not touch memory that is not allocated or not ours). However, if we just checked for the buffer being long enough (either by checking the length, or by allocating a specific amount of data), these bounds checks are superfluous. So far, we don't have an automated solution for this issue, but we are [discussing it in the OCaml community](https://discuss.ocaml.org/t/bounds-checks-for-string-and-bytes-when-retrieving-or-setting-subparts-thereof/), and are eager to find a solution to avoid unneeded computations. -To guide the performance engineering, we also developed [a microbenchmark](https://github.com/robur-coop/miragevpn/pull/230) using OCaml tooling. This will setup a client and server without any input and output, and transfer data. +As a first approach we connected with the MirageVPN unix client & OpenVPN client to a eduVPN server and ran speed tests using fast.com. This was sufficient to show the initial huge gap in download speeds between MirageVPN and OpenVPN. There is *a lot* of noise in this approach as there are many computers and routers involved in this setup, and it is hard to reproduce. +To get more reproducible results we set up a local VM with openvpn and iperf3 installed. On the host machine we can then connect to the VM's OpenVPN server and run iperf3 against the *VPN* ip address. This worked more reliably. However, it was still noisy and not suitable to measure single digit percentage changes in performance. +To better guide the performance engineering, we also developed [a microbenchmark](https://github.com/robur-coop/miragevpn/pull/230) using OCaml tooling. This will setup a client and server without any input and output, and transfer data in memory. To conclude: we already achieved a factor of 25 in performance by adapting the code in various ways. We have ideas to improve the performance even more in the future - we also work on using OCaml string and bytes, instead of off-the-OCaml-heap-allocated bigarrays (see [our previous article](https://blog.robur.coop/articles/speeding-ec-string.html), which provided some speedups). From 6d6ccf6a5efd699c1def2619bd92ad56af991dd5 Mon Sep 17 00:00:00 2001 From: Hannes Mehnert Date: Tue, 16 Apr 2024 14:22:06 +0200 Subject: [PATCH 3/4] move stuff around --- articles/miragevpn-performance.md | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/articles/miragevpn-performance.md b/articles/miragevpn-performance.md index 6d2d0d3..abf730d 100644 --- a/articles/miragevpn-performance.md +++ b/articles/miragevpn-performance.md @@ -2,13 +2,14 @@ date: 2024-04-16 article.title: Speeding up MirageVPN and use it in the wild article.description: - How we improved the performance of MirageVPN + Performance engineering of MirageVPN, speeding it up by a factor of 25. tags: - OCaml - MirageOS - cryptography - security - VPN + - performance author: name: Hannes Mehnert email: hannes@mehnert.org @@ -19,27 +20,31 @@ coauthors: link: https://reyn.ir/ --- -TODO: how to specify multiple authors? Is this possible? Use coauthors! - As we were busy continuing to work on [MirageVPN](https://github.com/robur-coop/miragevpn), we got in touch with [eduVPN](https://eduvpn.org), who are interested about deploying MirageVPN. We got example configuration from their side, and [fixed](https://github.com/robur-coop/miragevpn/pull/201) [some](https://github.com/robur-coop/miragevpn/pull/168) [issues](https://github.com/robur-coop/miragevpn/pull/202), and also implemented [tls-crypt](https://github.com/robur-coop/miragevpn/pull/169) - which was straightforward since we earlier spend time to implement [tls-crypt-v2](https://blog.robur.coop/articles/miragevpn.html). In January, they gave MirageVPN another try, and [measured the performance](https://github.com/robur-coop/miragevpn/issues/206) -- which was very poor -- MirageVPN (run as a Unix binary) provided a bandwith of 9.3Mb/s, while OpenVPN provided a bandwidth of 360Mb/s (using a VPN tunnel over TCP). We aim at spending less resources for computing, thus the result was not satisfying for us. We re-read a lot of code, refactored a lot, and are now at ~250Mb/s. -## Performance engineering +## Tooling for performance engineering of OCaml -For tooling, we used, apart from code reading, the Linux utility [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page) together with [Flamegraph](https://github.com/brendangregg/FlameGraph) to graph its output. This works nicely with OCaml programs (we're using the 4.14.1 compiler and runtime system). We did the performance engineering on Unix binaries, i.e. not on MirageOS unikernels - but the MirageVPN protocol is used in both scenarios - thus the performance improvements described here are also in the MirageVPN unikernels. +As a first approach we connected with the MirageVPN unix client & OpenVPN client to a eduVPN server and ran speed tests using [fast.com](https://fast.com). This was sufficient to show the initial huge gap in download speeds between MirageVPN and OpenVPN. There is *a lot* of noise in this approach as there are many computers and routers involved in this setup, and it is hard to reproduce. + +To get more reproducible results we set up a local VM with openvpn and iperf3 installed. On the host machine we can then connect to the VM's OpenVPN server and run iperf3 against the *VPN* ip address. This worked more reliably. However, it was still noisy and not suitable to measure single digit percentage changes in performance. +To better guide the performance engineering, we also developed [a microbenchmark](https://github.com/robur-coop/miragevpn/pull/230) using OCaml tooling. This will setup a client and server without any input and output, and transfer data in memory. + +We also re-read our code and used the Linux utility [`perf`](https://perf.wiki.kernel.org/index.php/Main_Page) together with [Flamegraph](https://github.com/brendangregg/FlameGraph) to graph its output. This works nicely with OCaml programs (we're using the 4.14.1 compiler and runtime system). We did the performance engineering on Unix binaries, i.e. not on MirageOS unikernels - but the MirageVPN protocol is used in both scenarios - thus the performance improvements described here are also in the MirageVPN unikernels. + +## Takeaway of performance engineering The learnings of our performance engineering are in three areas: - Formatting strings is computational expensive -- thus if in an error case a hexdump is produced of a packet, its construction must be delayed for when the error case is executed (we have [this PR](https://github.com/robur-coop/miragevpn/pull/220) and [that PR](https://github.com/robur-coop/miragevpn/pull/209)). Alain Frisch wrote a nice [blog post](https://www.lexifi.com/blog/ocaml/note-about-performance-printf-and-format/#) at LexiFi about performance of `Printf` and `Format`. - Rethink allocations: fundamentally, only a single big buffer (to be send out) for each incoming packet should be allocated, not a series of buffers that are concatenated (see [this PR](https://github.com/robur-coop/miragevpn/pull/217) and [that PR](https://github.com/robur-coop/miragevpn/pull/219)). Additionally, not zeroing out the just allocated buffer (if it is filled with data anyways) removes some further instructions (see [this PR](https://github.com/robur-coop/miragevpn/pull/218)). And we figured that appending to an empty buffer nevertheless allocated and copied in OCaml, so we worked on [this PR](https://github.com/robur-coop/miragevpn/pull/214). - Still an open topic is: we are in the memory-safe language OCaml, and we sometimes extract data out of a buffer (or set data in a buffer). Now, each operation lead to bounds checks (that we do not touch memory that is not allocated or not ours). However, if we just checked for the buffer being long enough (either by checking the length, or by allocating a specific amount of data), these bounds checks are superfluous. So far, we don't have an automated solution for this issue, but we are [discussing it in the OCaml community](https://discuss.ocaml.org/t/bounds-checks-for-string-and-bytes-when-retrieving-or-setting-subparts-thereof/), and are eager to find a solution to avoid unneeded computations. -As a first approach we connected with the MirageVPN unix client & OpenVPN client to a eduVPN server and ran speed tests using fast.com. This was sufficient to show the initial huge gap in download speeds between MirageVPN and OpenVPN. There is *a lot* of noise in this approach as there are many computers and routers involved in this setup, and it is hard to reproduce. -To get more reproducible results we set up a local VM with openvpn and iperf3 installed. On the host machine we can then connect to the VM's OpenVPN server and run iperf3 against the *VPN* ip address. This worked more reliably. However, it was still noisy and not suitable to measure single digit percentage changes in performance. -To better guide the performance engineering, we also developed [a microbenchmark](https://github.com/robur-coop/miragevpn/pull/230) using OCaml tooling. This will setup a client and server without any input and output, and transfer data in memory. To conclude: we already achieved a factor of 25 in performance by adapting the code in various ways. We have ideas to improve the performance even more in the future - we also work on using OCaml string and bytes, instead of off-the-OCaml-heap-allocated bigarrays (see [our previous article](https://blog.robur.coop/articles/speeding-ec-string.html), which provided some speedups). +Don't hesitate to reach out to us on [GitHub](https://github.com/robur-coop/miragevpn/issues), or [by mail](https://robur.coop/Contact) if you're stuck. + We want to thank [NLnet](https://nlnet.nl) for their funding (via [NGI assure](https://www.assure.ngi.eu/)), and [eduVPN](https://eduvpn.org) for their interest. From 8fcf51c6edf3600ca638c31feb4ff1629d5b795d Mon Sep 17 00:00:00 2001 From: Hannes Mehnert Date: Tue, 16 Apr 2024 14:23:55 +0200 Subject: [PATCH 4/4] another heading --- articles/miragevpn-performance.md | 1 + 1 file changed, 1 insertion(+) diff --git a/articles/miragevpn-performance.md b/articles/miragevpn-performance.md index abf730d..fd056ac 100644 --- a/articles/miragevpn-performance.md +++ b/articles/miragevpn-performance.md @@ -42,6 +42,7 @@ The learnings of our performance engineering are in three areas: - Rethink allocations: fundamentally, only a single big buffer (to be send out) for each incoming packet should be allocated, not a series of buffers that are concatenated (see [this PR](https://github.com/robur-coop/miragevpn/pull/217) and [that PR](https://github.com/robur-coop/miragevpn/pull/219)). Additionally, not zeroing out the just allocated buffer (if it is filled with data anyways) removes some further instructions (see [this PR](https://github.com/robur-coop/miragevpn/pull/218)). And we figured that appending to an empty buffer nevertheless allocated and copied in OCaml, so we worked on [this PR](https://github.com/robur-coop/miragevpn/pull/214). - Still an open topic is: we are in the memory-safe language OCaml, and we sometimes extract data out of a buffer (or set data in a buffer). Now, each operation lead to bounds checks (that we do not touch memory that is not allocated or not ours). However, if we just checked for the buffer being long enough (either by checking the length, or by allocating a specific amount of data), these bounds checks are superfluous. So far, we don't have an automated solution for this issue, but we are [discussing it in the OCaml community](https://discuss.ocaml.org/t/bounds-checks-for-string-and-bytes-when-retrieving-or-setting-subparts-thereof/), and are eager to find a solution to avoid unneeded computations. +## Conclusion To conclude: we already achieved a factor of 25 in performance by adapting the code in various ways. We have ideas to improve the performance even more in the future - we also work on using OCaml string and bytes, instead of off-the-OCaml-heap-allocated bigarrays (see [our previous article](https://blog.robur.coop/articles/speeding-ec-string.html), which provided some speedups).