blog.robur.coop/articles/tar-release.md

464 lines
22 KiB
Markdown

---
date: 2024-08-15
article.title: The new Tar release, a retrospective
article.description: A little retrospective to the new Tar release and changes
tags:
- OCaml
- Cstruct
- functors
author:
name: Romain Calascibetta
email: romain.calascibetta@gmail.com
link: https://blog.osau.re
---
We are delighted to announce the new release of `ocaml-tar`. A small library for
reading and writing tar archives in OCaml. Since this is a major release, we'll
take the time in this article to explain the work that's been done by the
cooperative on this project.
Tar is an **old** project. Originally written by David Scott as part of Mirage,
this project is particularly interesting for building bridges between the tools
we can offer and what already exists. Tar is, in fact, widely used. So we're
both dealing with a format that's older than I am (but I'm used to it by email)
and a project that's been around since... 2012 (over 10 years!).
But we intend to maintain and improve it, since we're using it for the
[opam-mirror][opam-mirror] project among other things - this unikernel is to
provide an opam-repository "tarball" for opam when you do `opam update`.
## `Cstruct.t` & bytes
As some of you may have noticed, over the last few months we've begun a fairly
substantial change to the Mirage ecosystem, replacing the use of `Cstruct.t` in
key places with bytes/string.
This choice is based on 2 considerations:
- we came to realize that `Cstruct.t` could be very costly in terms of
performance
- `Cstruct.t` remains a "Mirage" structure; outside the Mirage ecosystem, the
use of `Cstruct.t` is not so "obvious".
The pull-request is available here: https://github.com/mirage/ocaml-tar/pull/137.
The discussion can be interesting in discovering common bugs (uninitialized
buffer, invalid access). There's also a small benchmark to support our initial
intuition<sup>[1](#fn1)</sup>.
But this PR can also be an opportunity to understand the existence of
`Cstruct.t` in the Mirage ecosystem and the reasons for this historic choice.
### `Cstruct.t` as a non-moveable data
I've already [made][discuss-cstruct] a list of pros/cons when it comes to
bigarrays. Indeed, `Cstruct.t` is based on a bigarray:
```ocaml
type buffer = (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t
type t =
{ buffer : buffer
; off : int
; len : int }
```
The experienced reader may rightly wonder why Cstruct.t is a bigarray with `off`
and `len`. First, we need to clarify what a bigarray is for OCaml.
A bigarray is a somewhat special value in OCaml. This value is allocated in the
C heap. In other words, its contents are not in OCaml's garbage collector, but
exist outside it. The first (and very important) implication of this feature is
that the contents of a bigarray do not move (even if the GC tries to defragment
the memory). This feature has several advantages:
- in parallel programming, it can be very interesting to use a bigarray knowing
that, from the point of view of the 2 processes, the position of the bigarray
will never change - this is essentially what [parmap][parmap] does (before
OCaml 5).
- for calculations such as checksum or hash, it can be interesting to use a
bigarray. The calculation would not be interrupted by the GC since the
bigarray does not move. The calculation can therefore be continued at the same
point, which can help the CPU to better predict the next stage of the
calculation. This is what [digestif][digestif] offers and what
[decompress][decompress] requires.
- for one reason or another, particularly when interacting with something other
than OCaml, you need to offer a memory zone that cannot move. This is
particularly true for unikernels as Xen guests (where the _net device_
corresponds to a fixed memory zone with which we need to interact) or
[mmap][mmap].
- there are other subtleties more related to the way OCaml compiles. For
example, using bigarray layouts to manipulate "bigger words" can really have
an impact on performance, as [this PR][pr-utcp] on [utcp][utcp] shows.
- finally, it may be useful to store sensitive information in a bigarray so as
to have the opportunity to clean up this information as quickly as possible
(ensuring that the GC has not made a copy) in certain situations.
All these examples show that bigarrays can be of real interest as long as
**their uses are properly contextualized** - which ultimately remains very
specific. Our experience of using them in Mirage has shown us their advantages,
but also, and above all, their disadvantages:
- keep in mind that bigarray allocation uses either a system call like `mmap` or
`malloc()`. The latter, compared with what OCaml can offer, is slow. As soon
as you need to allocate bytes/strings smaller than
[`(256 * words)`][minor-alloc], these values are allocated in the minor heap,
which is incredibly fast to allocate (3 processor instructions which can be
predicted very well). So, preferring to allocate a 10-byte bigarray rather
than a 10-byte `bytes` penalizes you enormously.
- since the bigarray exists in the C heap, the GC has a special mechanism for
knowing when to `free()` the zone as soon as the value is no longer in use.
Reference-counting is used to then allocate "small" values in the OCaml heap
and use them to manipulate _indirectly_ the bigarray.
#### Ownership, proxy and GC
This last point deserves a little clarification, particularly with regard to the
`Bigarray.sub` function. This function will not create a new, smaller bigarray
and copy what was in the old one to the new one (as `Bytes.sub`/`String.sub`
does). In fact, OCaml will allocate a "proxy" of your bigarray that represents a
subfield. This is where _reference-counting_ comes in. This proxy value needs
the initial bigarray to be manipulated. So, as long as proxies exist, the GC
cannot `free()` the initial bigarray.
This poses several problems:
- the first is the allocation of these proxies. They can help us to manipulate
the initial bigarray in several places without copying it, but as time goes
by, these proxies could be very expensive
- the second is GC intervention. You still need to scan the bigarray, in a
particular way, to know whether or not to keep it. This particular scan, once
again in time immemorial, was not all that common.
- the third concerns bigarray ownership. Since we're talking about proxies, we
can imagine 2 competing tasks having access to the same bigarray.
As far as the first point is concerned, `Bigarray.sub` could still be "slow" for
small data since it was, _de facto_ (since a bigarray always has a finalizer -
don't forget reference counting!), allocated in the major heap. And, in truth,
this is perhaps the main reason for the existence of Cstruct! To have a "proxy"
to a bigarray allocated in the minor heap (and, be fast). But since
[Pierre Chambart's PR#92][bigarray-minor], the problem is no more.
The second point, on the other hand, is still topical, even if we can see that
[considerable efforts][better-bigarray-free] have been made. What we see every
day on our unikernels is [the pressure][gc-bigarray-pressure] that can be put on
the GC when it comes to bigarrays. Indeed, bigarrays use memory and making the C
heap cohabit with the OCaml heap inevitably comes at a cost. As far as
unikernels are concerned, which have a more limited memory than an OCaml
application, we reach this limit rather quickly and we therefore ask the GC to
work more specifically on our 10 or 20 byte bigarrays...
Finally, the third point can be the toughest. On several occasions, we've
noticed competing accesses on our bigarrays that we didn't want (for example,
`http-lwt-client` had [this problem][http-lwt-client-bug]). In our experience,
it's very difficult to observe and know that there is indeed an unauthorized
concurrent access changing the contents of our buffer. In this respect, the
question remains open as regards `Cstruct.t` and the possibility of encoding
ownership of a `Cstruct.t` in the type to prevent unauthorized access.
[This PR][cstruct-cap] is interesting to see all the discussions that have taken
place on this subject<sup>[2](#fn2)</sup>.
It should be noted that, with regard to the third point, the problem also
applies to bytes and the use of `Bytes.unsafe_to_string`!
### Conclusion about Cstruct
We hope we've been thorough enough in our experience with Cstruct. If we go back
to the initial definition of our `Cstruct.t` shown above and take all the
history into account, it becomes increasingly difficult to argue for a
**systematic** use of Cstruct in our unikernels. In fact, the question of
`Cstruct.t` versus bytes/string remains completely open.
It's worth noting that the original reasons for `Cstruct.t` are no longer really
relevant if we consider how OCaml has evolved. It should also be noted that this
systematic approach to using `Cstruct.t` rather than bytes/string has cost us.
This is not to say that `Cstruct.t` is obsolete. The library is very good and
offers an API where manipulating bytes to extract information such as a TCP/IP
packet remains more pleasant than directly using bytes (even if, here too,
[efforts][ocaml-getters] have been made).
As far as `ocaml-tar` is concerned, what really counts is the possibility for
other projects to use this library without requiring `Cstruct.t` - thus
facilitating its adoption. In other words, given the advantages/disadvantages of
`Cstruct.t`, we felt it would be a good idea to remove this dependency.
<hr />
<tag id="fn1">**1**</tag>: It should be noted that the benchmark also concerns
compression. In this case, we use `decompress`, which uses bigarrays. So there's
some copying involved (from bytes to bigarrays)! But despite this copying, it
seems that the change is worthwhile.
<tag id="fn2">**2**</tag>: It reminds me that we've been experimenting with
capabilities and using the type system to enforce certain characteristics. To
date, `Cstruct_cap` has not been used anywhere, which raises a real question
about the advantages/disadvantages in everyday use.
## Functors
This is perhaps the other point of the Mirage ecosystem that is also the subject
of debate. Functors! Before we talk about functors, we need to understand their
relevance in the context of Mirage.
Mirage transforms an application into an operating system. What's the difference
between a "normal" application and a unikernel: the "subsystem" with which you
interact. In this case, a normal application will interact with the host system,
whereas a unikernel will have to interact with the Solo5 _mini-system_.
What Mirage is trying to offer is the ability for an application to transform
itself into either without changing a thing! Mirage's aim is to **inject** the
subsystem into your application. In this case:
- inject `unix.cmxa` when you want a Mirage application to become a simple
executable
- inject [ocaml-solo5][ocaml-solo5] when you want to produce a unikernel
So we're not going to talk about the pros and cons of this approach here, but
consider this feature as one that requires us to use functors.
Indeed, what's the best way in OCaml to inject one implementation into another:
functors? There are definite advantages here too, but we're going to concentrate
on one in particular: the expressiveness of types at module level (which can be
used as arguments to our functors).
For example, did you know that OCaml has a dependent type system?
```ocaml
type 'a nat = Zero : zero nat | Succ : 'a nat -> 'a succ nat
and zero = |
and 'a succ = S
module type T = sig type t val v : t nat end
module type Rec = functor (T:T) -> T
module type Nat = functor (S:Rec) -> functor (Z:T) -> T
module Zero = functor (S:Rec) -> functor (Z:T) -> Z
module Succ = functor (N:Nat) -> functor (S:Rec) -> functor (Z:T) -> S(N(S)(Z))
module Add = functor (X:Nat) -> functor (Y:Nat) -> functor (S:Rec) -> functor (Z:T) -> X(S)(Y(S)(Z))
module One = Succ(Zero)
module Two_a = Add(One)(One)
module Two_b = Succ(One)
module Z : T with type t = zero = struct
type t = zero
let v = Zero
end
module S (T:T) : T with type t = T.t succ = struct
type t = T.t succ
let v = Succ T.v
end
module A = Two_a(S)(Z)
module B = Two_b(S)(Z)
type ('a, 'b) refl = Refl : ('a, 'a) refl
let _ : (A.t, B.t) refl = Refl (* 1+1 == succ 1 *)
```
The code is ... magical, but it shows that two differently constructed modules
(`Two_a` & `Two_b`) ultimately produce the same type, and OCaml is able to prove
this equality. Above all, the example shows just how powerful functors can be.
But it also shows just how difficult functors can be to understand and use.
In fact, this is one of Mirage's biggest drawbacks: the overuse of functors
makes the code difficult to read and understand. It can be difficult to deduce
in your head the type that results from an application of functors, and the
constraints associated with it... (yes, I don't use `merlin`).
But back to our initial problem: injection! In truth, the functor is a
fly-killing sledgehammer in most cases. There are many other ways of injecting
what the system would be (and how to do a `read` or `write`) into an
implementation. The best example, as [@nojb pointed out][nojb-response], is of
course [ocaml-tls][ocaml-tls] - this answer also shows a contrast between the
functor approach (with [CoHTTP][cohttp] for example) and the "pure value-passing
interface" of `ocaml-tls`.
What's more, we've been trying to find other approaches for injecting the system
we want for several years now. We can already list several:
- `ocaml-tls`' "value-passing" approach, of course, but also `decompress`
- of course, there's the passing of [a record][poor-man-functor] (a sort of
mini-module with fewer possibilities with types, but which does the job - a
poor man's functor, in short) which would have the functions to perform the
system's operations
- [mimic][mimic] can be used to inject a module as an implementation of a
flow/stream according to a resolution mechanism (DNS, `/etc/services`, etc.) -
a little closer to the idea of _runtime-resolved implicit implementations_
- there are, of course, the variants (but if we go back to 2010, this solution
wasn't so obvious) popularized by [ptime][ptime]/[mtime][mtime], `digestif` &
[dune][dune-variants]
- and finally, [GADTs][decompress-lzo], which describe what the process should
do, then let the user implement the `run` function according to the system.
In short, based on this list and the various experiments we've carried out on a
number of projects, we've decided to remove the functors from `ocaml-tar`! The
crucial question now is: which method to choose?
### The best answers
There's no real answer to that, and in truth it depends on what level of
abstraction you're at. In fact, you'd like to have a fairly simple method of
abstraction from the system at the start and at the lowest level, to end up
proposing a functor that does all the _ceremony_ (the glue between your
implementation and the system) at the end - that's what [ocaml-git][ocaml-git]
does, for example.
The abstraction you choose also depends on how the process is going to work. As
far as streams/protocols are concerned, the `ocaml-tls`/`decompress` approach
still seems the best. But when it comes to introspecting a file/block-device, it
may be preferable to use a GADT that will force the user to implement an
arbitrary memory access rather than consume a sequence of bytes. In short, at
this stage, experience speaks for itself and, just as we were wrong about
functors, we won't be advising you to use this or that solution.
But based on our experience of `ocaml-tls` & `decompress` with LZO (which
requires arbitrary access to the content) and the way Tar works, we decided to
use a "value-passing" approach (to describe when we need to read/write) and a
GADT to describe calculations such as:
- iterating over the files/folders contained in a Tar document
- producing a Tar file according to a "dispenser" of inputs
```ocaml
val decode : decode_state -> string ->
decode_state *
* [ `Read of int
| `Skip of int
| `Header of Header.t ] option
* Header.Extended.t option
(** [decode state] returns a new state and what the user should do next:
- [`Skip] skip bytes
- [`Read] read bytes
- [`Header hdr] do something according the last header extracted
(like stream-out the contents of a file). *)
type ('a, 'err) t =
| Really_read : int -> (string, 'err) t
| Read : int -> (string, 'err) t
| Seek : int -> (unit, 'err) t
| Bind : ('a, 'err) t * ('a -> ('b, 'err) t) -> ('b, 'err) t
| Return : ('a, 'err) result -> ('a, 'err) t
| Write : string -> (unit, 'err) t
```
However, and this is where we come back to OCaml's limitations and where
functors could help us: higher kinded polymorphism!
### Higher kinded Polymorphism
If we return to our functor example above, there's one element that may be of
interest: `T with type t = T.t succ`
In other words, add a constraint to a signature type. A constraint often seen
with Mirage (but deprecated now according to [this issue][mirage-lwt]) is the
type `io` and its constraint: `type 'a io`, `with type 'a io = 'a Lwt.t`.
So we had this type in Tar. The problem is that our GADT can't understand that
sometimes it will have to manipulate _Lwt_ values, sometimes _Async_ or
sometimes _Eio_ (or _Miou_!). In other words: how do we compose our `Bind` with
the `Bind` of these three targets? The difficulty lies above all in history?
Supporting this library requires us to assume a certain compatibility with
applications over which we have no control. What's more, we need to maintain
support for all three libraries without imposing one.
<hr />
A small disgression at this stage seems important to us, as we've been working
in this way for over 10 years. Of course, despite all the solutions mentioned
above, not depending on a system (and/or a scheduler) also allows us to ensure
the existence of libraries like Tar over more than a decade! The OCaml ecosystem
is changing, and choosing this or that library to facilitate the development of
an application has implications we might regret 10 years down the line (for
example... `Cstruct.t`!). So, it can be challenging to ensure compatibility with
all systems, but the result is libraries steeped in the experience and know-how
of many developers!
<hr />
So, and this is why we talk about Higher Kinded Polymorphism, how do we abstract
the `t` from `'a t` (to replace it with `Lwt.t` or even with a type such as
`type 'a t = 'a`)? This is where we're going to use the trick explained in
[this paper][hkt]. The trick is to consider a "new type" that will represent our
monad (lwt or async) and inject/project a value from this monad to something
understandable by our GADT: `High : ('a, 't) io -> ('a, 't) t`.
```ocaml
type ('a, 't) io
type ('a, 'err, 't) t =
| Really_read : int -> (string, 'err, 't) t
| Read : int -> (string, 'err, 't) t
| Seek : int -> (unit, 'err, 't) t
| Bind : ('a, 'err, 't) t * ('a -> ('b, 'err, 't) t) -> ('b, 'err, 't) t
| Return : ('a, 'err) result -> ('a, 'err, 't) t
| Write : string -> (unit, 'err, 't) t
| High : ('a, 't) io -> ('a, 'err, 't) t
```
Next, we need to create this new type according to the chosen scheduler. Let's
take _Lwt_ as an example:
```ocaml
module Make (X : sig type 'a t end) = struct
type t (* our new type *)
type 'a s = 'a X.t
external inj : 'a s -> ('a, t) io = "%identity"
external prj : ('a, t) io -> 'a s = "%identity"
end
module L = Make(Lwt)
let rec run
: type a err. (a, err, L.t) t -> (a, err) result Lwt.t
= function
| High v -> Ok (L.prj v)
| Return v -> Lwt.return v
| Bind (x, f) ->
run x >>= fun value -> run (f value)
| _ -> ...
```
So, as you can see, it's a real trick to avoid doing at home without a
companion. Indeed, the use of `%identity` corresponds to an `Obj.magic`! So even
if the `io` type is exposed (to let the user derive Tar for their own system),
this trick is not exposed for other packages, and we instead suggest helpers
such as:
```ocaml
val lwt : 'a Lwt.t -> ('a, 'err, lwt) t
val miou : 'a -> ('a, 'err, miou) t
```
But this way, Tar can always be derived from another system, and the process for
extracting entries from a Tar file is the same for **all** systems!
## Conclusion
This Tar release isn't as impressive as this article, but it does sum up all the
work we've been able to do over the last few months and years. We hope that our
work is appreciated and that this article, which sets out all the thoughts we've
had (and still have), helps you to better understand our work!
[opam-mirror]: https://hannes.robur.coop/Posts/OpamMirror
[discuss-cstruct]: https://discuss.ocaml.org/t/buffered-io-bytes-vs-bigstring/8978/3
[parmap]: https://github.com/rdicosmo/parmap
[digestif]: https://github.com/mirage/digestif
[decompress]: https://github.com/mirage/decompress
[pr-utcp]: https://github.com/robur-coop/utcp/pull/29
[utcp]: https://github.com/robur-coop/utcp
[mmap]: https://ocaml.org/manual/5.2/api/Unix.html#1_Mappingfilesintomemory
[minor-alloc]: https://github.com/ocaml/ocaml/blob/744006bfbfa045cc1ca442ff7b52c2650d2abe32/runtime/alloc.c#L175
[bigarray-minor]: https://github.com/ocaml/ocaml/pull/92
[http-lwt-client-bug]: https://github.com/robur-coop/http-lwt-client/pull/16
[cstruct-cap]: https://github.com/mirage/ocaml-cstruct/pull/237
[gc-bigarray-pressure]: https://github.com/ocaml/ocaml/issues/7750
[better-bigarray-free]: https://github.com/ocaml/ocaml/pull/1738
[ocaml-getters]: https://github.com/ocaml/ocaml/pull/1864
[ocaml-solo5]: https://github.com/mirage/ocaml-solo5
[nojb-response]: https://discuss.ocaml.org/t/best-practices-and-design-patterns-for-supporting-concurrent-io-in-libraries/15001/4?u=dinosaure
[ocaml-tls]: https://github.com/mirleft/ocaml-tls
[cohttp]: https://github.com/mirage/ocaml-cohttp
[poor-man-functor]: https://github.com/mirage/colombe/blob/07cd4cf134168ecd841924ee7ddda1a9af8fbd5a/src/sigs.ml#L13-L16
[mimic]: https://github.com/dinosaure/mimic
[ptime]: https://github.com/dbuenzli/ptime
[mtime]: https://github.com/dbuenzli/mtime
[dune-variants]: https://github.com/ocaml/dune/pull/1207
[decompress-lzo]: https://github.com/mirage/decompress/blob/c8301ba674e037b682338958d6d0bb5c42fd720e/lib/lzo.ml#L164-L175
[ocaml-git]: https://github.com/mirage/ocaml-git
[mirage-lwt]: https://github.com/mirage/mirage/issues/1004#issue-507517315
[hkt]: https://www.cl.cam.ac.uk/~jdy22/papers/lightweight-higher-kinded-polymorphism.pdf