Add OpenVPN CVEs article

2024-08-22 09:51:06 +02:00
1 changed files with 0 additions and 464 deletions
--- a/articles/tar-release.md
+++ b/articles/tar-release.md
@ -1,464 +0,0 @@
---
-date: 2024-08-15
-article.title: The new Tar release, a retrospective
-article.description: A little retrospective to the new Tar release and changes
-tags:
-  - OCaml
-  - Cstruct
-  - functors
-author:
-  name: Romain Calascibetta
-  email: romain.calascibetta@gmail.com
-  link: https://blog.osau.re
---
-We are delighted to announce the new release of `ocaml-tar`. A small library for
-reading and writing tar archives in OCaml. Since this is a major release, we'll
-take the time in this article to explain the work that's been done by the
-cooperative on this project.
-
-Tar is an **old** project. Originally written by David Scott as part of Mirage,
-this project is particularly interesting for building bridges between the tools
-we can offer and what already exists. Tar is, in fact, widely used. So we're
-both dealing with a format that's older than I am (but I'm used to it by email)
-and a project that's been around since... 2012 (over 10 years!).
-
-But we intend to maintain and improve it, since we're using it for the
-[opam-mirror][opam-mirror] project among other things - this unikernel is to
-provide an opam-repository "tarball" for opam when you do `opam update`.
-
-## `Cstruct.t` & bytes
-
-As some of you may have noticed, over the last few months we've begun a fairly
-substantial change to the Mirage ecosystem, replacing the use of `Cstruct.t` in
-key places with bytes/string.
-
-This choice is based on 2 considerations:
- we came to realize that `Cstruct.t` could be very costly in terms of
-  performance
- `Cstruct.t` remains a "Mirage" structure; outside the Mirage ecosystem, the
-  use of `Cstruct.t` is not so "obvious".
-
-The pull-request is available here: https://github.com/mirage/ocaml-tar/pull/137.
-The discussion can be interesting in discovering common bugs (uninitialized
-buffer, invalid access). There's also a small benchmark to support our initial
-intuition<sup>[1](#fn1)</sup>.
-
-But this PR can also be an opportunity to understand the existence of
-`Cstruct.t` in the Mirage ecosystem and the reasons for this historic choice.
-
-### `Cstruct.t` as a non-moveable data
-
-I've already [made][discuss-cstruct] a list of pros/cons when it comes to
-bigarrays. Indeed, `Cstruct.t` is based on a bigarray:
-```ocaml
-type buffer = (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t
-
-type t =
-  { buffer : buffer
-  ; off : int
-  ; len : int }
-```
-
-The experienced reader may rightly wonder why Cstruct.t is a bigarray with `off`
-and `len`. First, we need to clarify what a bigarray is for OCaml.
-
-A bigarray is a somewhat special value in OCaml. This value is allocated in the
-C heap. In other words, its contents are not in OCaml's garbage collector, but
-exist outside it. The first (and very important) implication of this feature is
-that the contents of a bigarray do not move (even if the GC tries to defragment
-the memory). This feature has several advantages:
- in parallel programming, it can be very interesting to use a bigarray knowing
-  that, from the point of view of the 2 processes, the position of the bigarray
-  will never change - this is essentially what [parmap][parmap] does (before
-  OCaml 5).
- for calculations such as checksum or hash, it can be interesting to use a
-  bigarray. The calculation would not be interrupted by the GC since the
-  bigarray does not move. The calculation can therefore be continued at the same
-  point, which can help the CPU to better predict the next stage of the
-  calculation. This is what [digestif][digestif] offers and what
-  [decompress][decompress] requires.
- for one reason or another, particularly when interacting with something other
-  than OCaml, you need to offer a memory zone that cannot move. This is
-  particularly true for unikernels as Xen guests (where the _net device_
-  corresponds to a fixed memory zone with which we need to interact) or
-  [mmap][mmap].
- there are other subtleties more related to the way OCaml compiles. For
-  example, using bigarray layouts to manipulate "bigger words" can really have
-  an impact on performance, as [this PR][pr-utcp] on [utcp][utcp] shows.
- finally, it may be useful to store sensitive information in a bigarray so as
-  to have the opportunity to clean up this information as quickly as possible
-  (ensuring that the GC has not made a copy) in certain situations.
-
-All these examples show that bigarrays can be of real interest as long as
-**their uses are properly contextualized** - which ultimately remains very
-specific. Our experience of using them in Mirage has shown us their advantages,
-but also, and above all, their disadvantages:
- keep in mind that bigarray allocation uses either a system call like `mmap` or
-  `malloc()`. The latter, compared with what OCaml can offer, is slow. As soon
-  as you need to allocate bytes/strings smaller than
-  [`(256 * words)`][minor-alloc], these values are allocated in the minor heap,
-  which is incredibly fast to allocate (3 processor instructions which can be
-  predicted very well). So, preferring to allocate a 10-byte bigarray rather
-  than a 10-byte `bytes` penalizes you enormously.
- since the bigarray exists in the C heap, the GC has a special mechanism for
-  knowing when to `free()` the zone as soon as the value is no longer in use.
-  Reference-counting is used to then allocate "small" values in the OCaml heap
-  and use them to manipulate _indirectly_ the bigarray.
-
-#### Ownership, proxy and GC
-
-This last point deserves a little clarification, particularly with regard to the
-`Bigarray.sub` function. This function will not create a new, smaller bigarray
-and copy what was in the old one to the new one (as `Bytes.sub`/`String.sub`
-does). In fact, OCaml will allocate a "proxy" of your bigarray that represents a
-subfield. This is where _reference-counting_ comes in. This proxy value needs
-the initial bigarray to be manipulated. So, as long as proxies exist, the GC
-cannot `free()` the initial bigarray.
-
-This poses several problems:
- the first is the allocation of these proxies. They can help us to manipulate
-  the initial bigarray in several places without copying it, but as time goes
-  by, these proxies could be very expensive
- the second is GC intervention. You still need to scan the bigarray, in a
-  particular way, to know whether or not to keep it. This particular scan, once
-  again in time immemorial, was not all that common.
- the third concerns bigarray ownership. Since we're talking about proxies, we
-  can imagine 2 competing tasks having access to the same bigarray.
-
-As far as the first point is concerned, `Bigarray.sub` could still be "slow" for
-small data since it was, _de facto_ (since a bigarray always has a finalizer -
-don't forget reference counting!), allocated in the major heap. And, in truth,
-this is perhaps the main reason for the existence of Cstruct! To have a "proxy"
-to a bigarray allocated in the minor heap (and, be fast). But since
-[Pierre Chambart's PR#92][bigarray-minor], the problem is no more.
-
-The second point, on the other hand, is still topical, even if we can see that
-[considerable efforts][better-bigarray-free] have been made. What we see every
-day on our unikernels is [the pressure][gc-bigarray-pressure] that can be put on
-the GC when it comes to bigarrays. Indeed, bigarrays use memory and making the C
-heap cohabit with the OCaml heap inevitably comes at a cost. As far as
-unikernels are concerned, which have a more limited memory than an OCaml
-application, we reach this limit rather quickly and we therefore ask the GC to
-work more specifically on our 10 or 20 byte bigarrays...
-
-Finally, the third point can be the toughest. On several occasions, we've
-noticed competing accesses on our bigarrays that we didn't want (for example,
-`http-lwt-client` had [this problem][http-lwt-client-bug]). In our experience,
-it's very difficult to observe and know that there is indeed an unauthorized
-concurrent access changing the contents of our buffer. In this respect, the
-question remains open as regards `Cstruct.t` and the possibility of encoding
-ownership of a `Cstruct.t` in the type to prevent unauthorized access.
-[This PR][cstruct-cap] is interesting to see all the discussions that have taken
-place on this subject<sup>[2](#fn2)</sup>.
-
-It should be noted that, with regard to the third point, the problem also
-applies to bytes and the use of `Bytes.unsafe_to_string`!
-
-### Conclusion about Cstruct
-
-We hope we've been thorough enough in our experience with Cstruct. If we go back
-to the initial definition of our `Cstruct.t` shown above and take all the
-history into account, it becomes increasingly difficult to argue for a
-**systematic** use of Cstruct in our unikernels. In fact, the question of
-`Cstruct.t` versus bytes/string remains completely open.
-
-It's worth noting that the original reasons for `Cstruct.t` are no longer really
-relevant if we consider how OCaml has evolved. It should also be noted that this
-systematic approach to using `Cstruct.t` rather than bytes/string has cost us.
-
-This is not to say that `Cstruct.t` is obsolete. The library is very good and
-offers an API where manipulating bytes to extract information such as a TCP/IP
-packet remains more pleasant than directly using bytes (even if, here too,
-[efforts][ocaml-getters] have been made).
-
-As far as `ocaml-tar` is concerned, what really counts is the possibility for
-other projects to use this library without requiring `Cstruct.t` - thus
-facilitating its adoption. In other words, given the advantages/disadvantages of
-`Cstruct.t`, we felt it would be a good idea to remove this dependency.
-
-<hr />
-
-<tag id="fn1">**1**</tag>: It should be noted that the benchmark also concerns
-compression. In this case, we use `decompress`, which uses bigarrays. So there's
-some copying involved (from bytes to bigarrays)! But despite this copying, it
-seems that the change is worthwhile.
-
-<tag id="fn2">**2**</tag>: It reminds me that we've been experimenting with
-capabilities and using the type system to enforce certain characteristics. To
-date, `Cstruct_cap` has not been used anywhere, which raises a real question
-about the advantages/disadvantages in everyday use.
-
-## Functors
-
-This is perhaps the other point of the Mirage ecosystem that is also the subject
-of debate. Functors! Before we talk about functors, we need to understand their
-relevance in the context of Mirage.
-
-Mirage transforms an application into an operating system. What's the difference
-between a "normal" application and a unikernel: the "subsystem" with which you
-interact. In this case, a normal application will interact with the host system,
-whereas a unikernel will have to interact with the Solo5 _mini-system_.
-
-What Mirage is trying to offer is the ability for an application to transform
-itself into either without changing a thing! Mirage's aim is to **inject** the
-subsystem into your application. In this case:
- inject `unix.cmxa` when you want a Mirage application to become a simple
-  executable
- inject [ocaml-solo5][ocaml-solo5] when you want to produce a unikernel
-
-So we're not going to talk about the pros and cons of this approach here, but
-consider this feature as one that requires us to use functors.
-
-Indeed, what's the best way in OCaml to inject one implementation into another:
-functors? There are definite advantages here too, but we're going to concentrate
-on one in particular: the expressiveness of types at module level (which can be
-used as arguments to our functors).
-
-For example, did you know that OCaml has a dependent type system?
-```ocaml
-type 'a nat = Zero : zero nat | Succ : 'a nat -> 'a succ nat
-and zero = |
-and 'a succ = S
-
-module type T = sig type t val v : t nat end
-module type Rec = functor (T:T) -> T
-module type Nat = functor (S:Rec) -> functor (Z:T) -> T
-
-module Zero = functor (S:Rec) -> functor (Z:T) -> Z
-module Succ = functor (N:Nat) -> functor (S:Rec) -> functor (Z:T) -> S(N(S)(Z))
-module Add = functor (X:Nat) -> functor (Y:Nat) -> functor (S:Rec) -> functor (Z:T) -> X(S)(Y(S)(Z))
-
-module One = Succ(Zero)
-module Two_a = Add(One)(One)
-module Two_b = Succ(One)
-
-module Z : T with type t = zero = struct
-  type t = zero
-  let v = Zero
-end
-
-module S (T:T) : T with type t = T.t succ = struct
-  type t = T.t succ
-  let v = Succ T.v
-end
-
-module A = Two_a(S)(Z)
-module B = Two_b(S)(Z)
-
-type ('a, 'b) refl = Refl : ('a, 'a) refl
-
-let _ : (A.t, B.t) refl = Refl (* 1+1 == succ 1 *)
-```
-
-The code is ... magical, but it shows that two differently constructed modules
-(`Two_a` & `Two_b`) ultimately produce the same type, and OCaml is able to prove
-this equality. Above all, the example shows just how powerful functors can be.
-But it also shows just how difficult functors can be to understand and use.
-
-In fact, this is one of Mirage's biggest drawbacks: the overuse of functors
-makes the code difficult to read and understand. It can be difficult to deduce
-in your head the type that results from an application of functors, and the
-constraints associated with it... (yes, I don't use `merlin`).
-
-But back to our initial problem: injection! In truth, the functor is a
-fly-killing sledgehammer in most cases. There are many other ways of injecting
-what the system would be (and how to do a `read` or `write`) into an
-implementation. The best example, as [@nojb pointed out][nojb-response], is of
-course [ocaml-tls][ocaml-tls] - this answer also shows a contrast between the
-functor approach (with [CoHTTP][cohttp] for example) and the "pure value-passing
-interface" of `ocaml-tls`.
-
-What's more, we've been trying to find other approaches for injecting the system
-we want for several years now. We can already list several:
- `ocaml-tls`' "value-passing" approach, of course, but also `decompress`
- of course, there's the passing of [a record][poor-man-functor] (a sort of
-  mini-module with fewer possibilities with types, but which does the job - a
-  poor man's functor, in short) which would have the functions to perform the
-  system's operations
- [mimic][mimic] can be used to inject a module as an implementation of a
-  flow/stream according to a resolution mechanism (DNS, `/etc/services`, etc.) -
-  a little closer to the idea of _runtime-resolved implicit implementations_
- there are, of course, the variants (but if we go back to 2010, this solution
-  wasn't so obvious) popularized by [ptime][ptime]/[mtime][mtime], `digestif` &
-  [dune][dune-variants]
- and finally, [GADTs][decompress-lzo], which describe what the process should
-  do, then let the user implement the `run` function according to the system.
-
-In short, based on this list and the various experiments we've carried out on a
-number of projects, we've decided to remove the functors from `ocaml-tar`! The
-crucial question now is: which method to choose?
-
-### The best answers
-
-There's no real answer to that, and in truth it depends on what level of
-abstraction you're at. In fact, you'd like to have a fairly simple method of
-abstraction from the system at the start and at the lowest level, to end up
-proposing a functor that does all the _ceremony_ (the glue between your
-implementation and the system) at the end - that's what [ocaml-git][ocaml-git]
-does, for example.
-
-The abstraction you choose also depends on how the process is going to work. As
-far as streams/protocols are concerned, the `ocaml-tls`/`decompress` approach
-still seems the best. But when it comes to introspecting a file/block-device, it
-may be preferable to use a GADT that will force the user to implement an
-arbitrary memory access rather than consume a sequence of bytes. In short, at
-this stage, experience speaks for itself and, just as we were wrong about
-functors, we won't be advising you to use this or that solution.
-
-But based on our experience of `ocaml-tls` & `decompress` with LZO (which
-requires arbitrary access to the content) and the way Tar works, we decided to
-use a "value-passing" approach (to describe when we need to read/write) and a
-GADT to describe calculations such as:
- iterating over the files/folders contained in a Tar document
- producing a Tar file according to a "dispenser" of inputs
-
-```ocaml
-val decode : decode_state -> string ->
-  decode_state *
-   * [ `Read of int
-     | `Skip of int
-     | `Header of Header.t ] option
-   * Header.Extended.t option
-(** [decode state] returns a new state and what the user should do next:
-    - [`Skip] skip bytes
-    - [`Read] read bytes
-    - [`Header hdr] do something according the last header extracted
-      (like stream-out the contents of a file). *)
-
-type ('a, 'err) t =
-  | Really_read : int -> (string, 'err) t
-  | Read : int -> (string, 'err) t
-  | Seek : int -> (unit, 'err) t
-  | Bind : ('a, 'err) t * ('a -> ('b, 'err) t) -> ('b, 'err) t
-  | Return : ('a, 'err) result -> ('a, 'err) t
-  | Write : string -> (unit, 'err) t
-```
-
-However, and this is where we come back to OCaml's limitations and where
-functors could help us: higher kinded polymorphism!
-
-### Higher kinded Polymorphism
-
-If we return to our functor example above, there's one element that may be of
-interest: `T with type t = T.t succ`
-
-In other words, add a constraint to a signature type. A constraint often seen
-with Mirage (but deprecated now according to [this issue][mirage-lwt]) is the
-type `io` and its constraint: `type 'a io`, `with type 'a io = 'a Lwt.t`.
-
-So we had this type in Tar. The problem is that our GADT can't understand that
-sometimes it will have to manipulate _Lwt_ values, sometimes _Async_ or
-sometimes _Eio_ (or _Miou_!). In other words: how do we compose our `Bind` with
-the `Bind` of these three targets? The difficulty lies above all in history?
-Supporting this library requires us to assume a certain compatibility with
-applications over which we have no control. What's more, we need to maintain
-support for all three libraries without imposing one.
-
-<hr />
-
-A small disgression at this stage seems important to us, as we've been working
-in this way for over 10 years. Of course, despite all the solutions mentioned
-above, not depending on a system (and/or a scheduler) also allows us to ensure
-the existence of libraries like Tar over more than a decade! The OCaml ecosystem
-is changing, and choosing this or that library to facilitate the development of
-an application has implications we might regret 10 years down the line (for
-example... `Cstruct.t`!). So, it can be challenging to ensure compatibility with
-all systems, but the result is libraries steeped in the experience and know-how
-of many developers!
-
-<hr />
-
-So, and this is why we talk about Higher Kinded Polymorphism, how do we abstract
-the `t` from `'a t` (to replace it with `Lwt.t` or even with a type such as
-`type 'a t = 'a`)? This is where we're going to use the trick explained in
-[this paper][hkt]. The trick is to consider a "new type" that will represent our
-monad (lwt or async) and inject/project a value from this monad to something
-understandable by our GADT: `High : ('a, 't) io -> ('a, 't) t`.
-
-```ocaml
-type ('a, 't) io
-
-type ('a, 'err, 't) t =
-  | Really_read : int -> (string, 'err, 't) t
-  | Read : int -> (string, 'err, 't) t
-  | Seek : int -> (unit, 'err, 't) t
-  | Bind : ('a, 'err, 't) t * ('a -> ('b, 'err, 't) t) -> ('b, 'err, 't) t
-  | Return : ('a, 'err) result -> ('a, 'err, 't) t
-  | Write : string -> (unit, 'err, 't) t
-  | High : ('a, 't) io -> ('a, 'err, 't) t
-```
-
-Next, we need to create this new type according to the chosen scheduler. Let's
-take _Lwt_ as an example:
-
-```ocaml
-module Make (X : sig type 'a t end) = struct
-  type t (* our new type *)
-  type 'a s = 'a X.t
-  
-  external inj : 'a s -> ('a, t) io = "%identity"
-  external prj : ('a, t) io -> 'a s = "%identity"
-end
-
-module L = Make(Lwt)
-
-let rec run
-  : type a err. (a, err, L.t) t -> (a, err) result Lwt.t
-  = function
-  | High v -> Ok (L.prj v)
-  | Return v -> Lwt.return v
-  | Bind (x, f) ->
-    run x >>= fun value -> run (f value)
-  | _ -> ...
-```
-
-So, as you can see, it's a real trick to avoid doing at home without a
-companion. Indeed, the use of `%identity` corresponds to an `Obj.magic`! So even
-if the `io` type is exposed (to let the user derive Tar for their own system),
-this trick is not exposed for other packages, and we instead suggest helpers
-such as:
-
-```ocaml
-val lwt : 'a Lwt.t -> ('a, 'err, lwt) t
-val miou : 'a -> ('a, 'err, miou) t
-```
-
-But this way, Tar can always be derived from another system, and the process for
-extracting entries from a Tar file is the same for **all** systems!
-
-## Conclusion
-
-This Tar release isn't as impressive as this article, but it does sum up all the
-work we've been able to do over the last few months and years. We hope that our
-work is appreciated and that this article, which sets out all the thoughts we've
-had (and still have), helps you to better understand our work!
-
-[opam-mirror]: https://hannes.robur.coop/Posts/OpamMirror
-[discuss-cstruct]: https://discuss.ocaml.org/t/buffered-io-bytes-vs-bigstring/8978/3
-[parmap]: https://github.com/rdicosmo/parmap
-[digestif]: https://github.com/mirage/digestif
-[decompress]: https://github.com/mirage/decompress
-[pr-utcp]: https://github.com/robur-coop/utcp/pull/29
-[utcp]: https://github.com/robur-coop/utcp
-[mmap]: https://ocaml.org/manual/5.2/api/Unix.html#1_Mappingfilesintomemory
-[minor-alloc]: https://github.com/ocaml/ocaml/blob/744006bfbfa045cc1ca442ff7b52c2650d2abe32/runtime/alloc.c#L175
-[bigarray-minor]: https://github.com/ocaml/ocaml/pull/92
-[http-lwt-client-bug]: https://github.com/robur-coop/http-lwt-client/pull/16
-[cstruct-cap]: https://github.com/mirage/ocaml-cstruct/pull/237
-[gc-bigarray-pressure]: https://github.com/ocaml/ocaml/issues/7750
-[better-bigarray-free]: https://github.com/ocaml/ocaml/pull/1738
-[ocaml-getters]: https://github.com/ocaml/ocaml/pull/1864
-[ocaml-solo5]: https://github.com/mirage/ocaml-solo5
-[nojb-response]: https://discuss.ocaml.org/t/best-practices-and-design-patterns-for-supporting-concurrent-io-in-libraries/15001/4?u=dinosaure
-[ocaml-tls]: https://github.com/mirleft/ocaml-tls
-[cohttp]: https://github.com/mirage/ocaml-cohttp
-[poor-man-functor]: https://github.com/mirage/colombe/blob/07cd4cf134168ecd841924ee7ddda1a9af8fbd5a/src/sigs.ml#L13-L16
-[mimic]: https://github.com/dinosaure/mimic
-[ptime]: https://github.com/dbuenzli/ptime
-[mtime]: https://github.com/dbuenzli/mtime
-[dune-variants]: https://github.com/ocaml/dune/pull/1207
-[decompress-lzo]: https://github.com/mirage/decompress/blob/c8301ba674e037b682338958d6d0bb5c42fd720e/lib/lzo.ml#L164-L175
-[ocaml-git]: https://github.com/mirage/ocaml-git
-[mirage-lwt]: https://github.com/mirage/mirage/issues/1004#issue-507517315
-[hkt]: https://www.cl.cam.ac.uk/~jdy22/papers/lightweight-higher-kinded-polymorphism.pdf
-