The new Tar release, a retrospective
+We are delighted to announce the new release of ocaml-tar
. A small library for
+reading and writing tar archives in OCaml. Since this is a major release, we'll
+take the time in this article to explain the work that's been done by the
+cooperative on this project.
Tar is an old project. Originally written by David Scott as part of Mirage, +this project is particularly interesting for building bridges between the tools +we can offer and what already exists. Tar is, in fact, widely used. So we're +both dealing with a format that's older than I am (but I'm used to it by email) +and a project that's been around since... 2012 (over 10 years!).
+But we intend to maintain and improve it, since we're using it for the
+opam-mirror project among other things - this unikernel is to
+provide an opam-repository "tarball" for opam when you do opam update
.
Cstruct.t
& bytes
+As some of you may have noticed, over the last few months we've begun a fairly
+substantial change to the Mirage ecosystem, replacing the use of Cstruct.t
in
+key places with bytes/string.
This choice is based on 2 considerations:
+-
+
- we came to realize that
Cstruct.t
could be very costly in terms of +performance
+ Cstruct.t
remains a "Mirage" structure; outside the Mirage ecosystem, the +use ofCstruct.t
is not so "obvious".
+
The pull-request is available here: https://github.com/mirage/ocaml-tar/pull/137. +The discussion can be interesting in discovering common bugs (uninitialized +buffer, invalid access). There's also a small benchmark to support our initial +intuition1.
+But this PR can also be an opportunity to understand the existence of
+Cstruct.t
in the Mirage ecosystem and the reasons for this historic choice.
Cstruct.t
as a non-moveable data
+I've already made a list of pros/cons when it comes to
+bigarrays. Indeed, Cstruct.t
is based on a bigarray:
type buffer = (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t
+
+type t =
+ { buffer : buffer
+ ; off : int
+ ; len : int }
+
+The experienced reader may rightly wonder why Cstruct.t is a bigarray with off
+and len
. First, we need to clarify what a bigarray is for OCaml.
A bigarray is a somewhat special value in OCaml. This value is allocated in the +C heap. In other words, its contents are not in OCaml's garbage collector, but +exist outside it. The first (and very important) implication of this feature is +that the contents of a bigarray do not move (even if the GC tries to defragment +the memory). This feature has several advantages:
+-
+
- in parallel programming, it can be very interesting to use a bigarray knowing +that, from the point of view of the 2 processes, the position of the bigarray +will never change - this is essentially what parmap does (before +OCaml 5). +
- for calculations such as checksum or hash, it can be interesting to use a +bigarray. The calculation would not be interrupted by the GC since the +bigarray does not move. The calculation can therefore be continued at the same +point, which can help the CPU to better predict the next stage of the +calculation. This is what digestif offers and what +decompress requires. +
- for one reason or another, particularly when interacting with something other +than OCaml, you need to offer a memory zone that cannot move. This is +particularly true for unikernels as Xen guests (where the net device +corresponds to a fixed memory zone with which we need to interact) or +mmap. +
- there are other subtleties more related to the way OCaml compiles. For +example, using bigarray layouts to manipulate "bigger words" can really have +an impact on performance, as this PR on utcp shows. +
- finally, it may be useful to store sensitive information in a bigarray so as +to have the opportunity to clean up this information as quickly as possible +(ensuring that the GC has not made a copy) in certain situations. +
All these examples show that bigarrays can be of real interest as long as +their uses are properly contextualized - which ultimately remains very +specific. Our experience of using them in Mirage has shown us their advantages, +but also, and above all, their disadvantages:
+-
+
- keep in mind that bigarray allocation uses either a system call like
mmap
or +malloc()
. The latter, compared with what OCaml can offer, is slow. As soon +as you need to allocate bytes/strings smaller than +(256 * words)
, these values are allocated in the minor heap, +which is incredibly fast to allocate (3 processor instructions which can be +predicted very well). So, preferring to allocate a 10-byte bigarray rather +than a 10-bytebytes
penalizes you enormously.
+ - since the bigarray exists in the C heap, the GC has a special mechanism for
+knowing when to
free()
the zone as soon as the value is no longer in use. +Reference-counting is used to then allocate "small" values in the OCaml heap +and use them to manipulate indirectly the bigarray.
+
Ownership, proxy and GC
+This last point deserves a little clarification, particularly with regard to the
+Bigarray.sub
function. This function will not create a new, smaller bigarray
+and copy what was in the old one to the new one (as Bytes.sub
/String.sub
+does). In fact, OCaml will allocate a "proxy" of your bigarray that represents a
+subfield. This is where reference-counting comes in. This proxy value needs
+the initial bigarray to be manipulated. So, as long as proxies exist, the GC
+cannot free()
the initial bigarray.
This poses several problems:
+-
+
- the first is the allocation of these proxies. They can help us to manipulate +the initial bigarray in several places without copying it, but as time goes +by, these proxies could be very expensive +
- the second is GC intervention. You still need to scan the bigarray, in a +particular way, to know whether or not to keep it. This particular scan, once +again in time immemorial, was not all that common. +
- the third concerns bigarray ownership. Since we're talking about proxies, we +can imagine 2 competing tasks having access to the same bigarray. +
As far as the first point is concerned, Bigarray.sub
could still be "slow" for
+small data since it was, de facto (since a bigarray always has a finalizer -
+don't forget reference counting!), allocated in the major heap. And, in truth,
+this is perhaps the main reason for the existence of Cstruct! To have a "proxy"
+to a bigarray allocated in the minor heap (and, be fast). But since
+Pierre Chambart's PR#92, the problem is no more.
The second point, on the other hand, is still topical, even if we can see that +considerable efforts have been made. What we see every +day on our unikernels is the pressure that can be put on +the GC when it comes to bigarrays. Indeed, bigarrays use memory and making the C +heap cohabit with the OCaml heap inevitably comes at a cost. As far as +unikernels are concerned, which have a more limited memory than an OCaml +application, we reach this limit rather quickly and we therefore ask the GC to +work more specifically on our 10 or 20 byte bigarrays...
+Finally, the third point can be the toughest. On several occasions, we've
+noticed competing accesses on our bigarrays that we didn't want (for example,
+http-lwt-client
had this problem). In our experience,
+it's very difficult to observe and know that there is indeed an unauthorized
+concurrent access changing the contents of our buffer. In this respect, the
+question remains open as regards Cstruct.t
and the possibility of encoding
+ownership of a Cstruct.t
in the type to prevent unauthorized access.
+This PR is interesting to see all the discussions that have taken
+place on this subject2.
It should be noted that, with regard to the third point, the problem also
+applies to bytes and the use of Bytes.unsafe_to_string
!
Conclusion about Cstruct
+We hope we've been thorough enough in our experience with Cstruct. If we go back
+to the initial definition of our Cstruct.t
shown above and take all the
+history into account, it becomes increasingly difficult to argue for a
+systematic use of Cstruct in our unikernels. In fact, the question of
+Cstruct.t
versus bytes/string remains completely open.
It's worth noting that the original reasons for Cstruct.t
are no longer really
+relevant if we consider how OCaml has evolved. It should also be noted that this
+systematic approach to using Cstruct.t
rather than bytes/string has cost us.
This is not to say that Cstruct.t
is obsolete. The library is very good and
+offers an API where manipulating bytes to extract information such as a TCP/IP
+packet remains more pleasant than directly using bytes (even if, here too,
+efforts have been made).
As far as ocaml-tar
is concerned, what really counts is the possibility for
+other projects to use this library without requiring Cstruct.t
- thus
+facilitating its adoption. In other words, given the advantages/disadvantages of
+Cstruct.t
, we felt it would be a good idea to remove this dependency.
+
decompress
, which uses bigarrays. So there's
+some copying involved (from bytes to bigarrays)! But despite this copying, it
+seems that the change is worthwhile.
Cstruct_cap
has not been used anywhere, which raises a real question
+about the advantages/disadvantages in everyday use.
Functors
+This is perhaps the other point of the Mirage ecosystem that is also the subject +of debate. Functors! Before we talk about functors, we need to understand their +relevance in the context of Mirage.
+Mirage transforms an application into an operating system. What's the difference +between a "normal" application and a unikernel: the "subsystem" with which you +interact. In this case, a normal application will interact with the host system, +whereas a unikernel will have to interact with the Solo5 mini-system.
+What Mirage is trying to offer is the ability for an application to transform +itself into either without changing a thing! Mirage's aim is to inject the +subsystem into your application. In this case:
+-
+
- inject
unix.cmxa
when you want a Mirage application to become a simple +executable
+ - inject ocaml-solo5 when you want to produce a unikernel +
So we're not going to talk about the pros and cons of this approach here, but +consider this feature as one that requires us to use functors.
+Indeed, what's the best way in OCaml to inject one implementation into another: +functors? There are definite advantages here too, but we're going to concentrate +on one in particular: the expressiveness of types at module level (which can be +used as arguments to our functors).
+For example, did you know that OCaml has a dependent type system?
+type 'a nat = Zero : zero nat | Succ : 'a nat -> 'a succ nat
+and zero = |
+and 'a succ = S
+
+module type T = sig type t val v : t nat end
+module type Rec = functor (T:T) -> T
+module type Nat = functor (S:Rec) -> functor (Z:T) -> T
+
+module Zero = functor (S:Rec) -> functor (Z:T) -> Z
+module Succ = functor (N:Nat) -> functor (S:Rec) -> functor (Z:T) -> S(N(S)(Z))
+module Add = functor (X:Nat) -> functor (Y:Nat) -> functor (S:Rec) -> functor (Z:T) -> X(S)(Y(S)(Z))
+
+module One = Succ(Zero)
+module Two_a = Add(One)(One)
+module Two_b = Succ(One)
+
+module Z : T with type t = zero = struct
+ type t = zero
+ let v = Zero
+end
+
+module S (T:T) : T with type t = T.t succ = struct
+ type t = T.t succ
+ let v = Succ T.v
+end
+
+module A = Two_a(S)(Z)
+module B = Two_b(S)(Z)
+
+type ('a, 'b) refl = Refl : ('a, 'a) refl
+
+let _ : (A.t, B.t) refl = Refl (* 1+1 == succ 1 *)
+
+The code is ... magical, but it shows that two differently constructed modules
+(Two_a
& Two_b
) ultimately produce the same type, and OCaml is able to prove
+this equality. Above all, the example shows just how powerful functors can be.
+But it also shows just how difficult functors can be to understand and use.
In fact, this is one of Mirage's biggest drawbacks: the overuse of functors
+makes the code difficult to read and understand. It can be difficult to deduce
+in your head the type that results from an application of functors, and the
+constraints associated with it... (yes, I don't use merlin
).
But back to our initial problem: injection! In truth, the functor is a
+fly-killing sledgehammer in most cases. There are many other ways of injecting
+what the system would be (and how to do a read
or write
) into an
+implementation. The best example, as @nojb pointed out, is of
+course ocaml-tls - this answer also shows a contrast between the
+functor approach (with CoHTTP for example) and the "pure value-passing
+interface" of ocaml-tls
.
What's more, we've been trying to find other approaches for injecting the system +we want for several years now. We can already list several:
+-
+
ocaml-tls
' "value-passing" approach, of course, but alsodecompress
+- of course, there's the passing of a record (a sort of +mini-module with fewer possibilities with types, but which does the job - a +poor man's functor, in short) which would have the functions to perform the +system's operations +
- mimic can be used to inject a module as an implementation of a
+flow/stream according to a resolution mechanism (DNS,
/etc/services
, etc.) - +a little closer to the idea of runtime-resolved implicit implementations
+ - there are, of course, the variants (but if we go back to 2010, this solution
+wasn't so obvious) popularized by ptime/mtime,
digestif
& +dune
+ - and finally, GADTs, which describe what the process should
+do, then let the user implement the
run
function according to the system.
+
In short, based on this list and the various experiments we've carried out on a
+number of projects, we've decided to remove the functors from ocaml-tar
! The
+crucial question now is: which method to choose?
The best answers
+There's no real answer to that, and in truth it depends on what level of +abstraction you're at. In fact, you'd like to have a fairly simple method of +abstraction from the system at the start and at the lowest level, to end up +proposing a functor that does all the ceremony (the glue between your +implementation and the system) at the end - that's what ocaml-git +does, for example.
+The abstraction you choose also depends on how the process is going to work. As
+far as streams/protocols are concerned, the ocaml-tls
/decompress
approach
+still seems the best. But when it comes to introspecting a file/block-device, it
+may be preferable to use a GADT that will force the user to implement an
+arbitrary memory access rather than consume a sequence of bytes. In short, at
+this stage, experience speaks for itself and, just as we were wrong about
+functors, we won't be advising you to use this or that solution.
But based on our experience of ocaml-tls
& decompress
with LZO (which
+requires arbitrary access to the content) and the way Tar works, we decided to
+use a "value-passing" approach (to describe when we need to read/write) and a
+GADT to describe calculations such as:
-
+
- iterating over the files/folders contained in a Tar document +
- producing a Tar file according to a "dispenser" of inputs +
val decode : decode_state -> string ->
+ decode_state *
+ * [ `Read of int
+ | `Skip of int
+ | `Header of Header.t ] option
+ * Header.Extended.t option
+(** [decode state] returns a new state and what the user should do next:
+ - [`Skip] skip bytes
+ - [`Read] read bytes
+ - [`Header hdr] do something according the last header extracted
+ (like stream-out the contents of a file). *)
+
+type ('a, 'err) t =
+ | Really_read : int -> (string, 'err) t
+ | Read : int -> (string, 'err) t
+ | Seek : int -> (unit, 'err) t
+ | Bind : ('a, 'err) t * ('a -> ('b, 'err) t) -> ('b, 'err) t
+ | Return : ('a, 'err) result -> ('a, 'err) t
+ | Write : string -> (unit, 'err) t
+
+However, and this is where we come back to OCaml's limitations and where +functors could help us: higher kinded polymorphism!
+Higher kinded Polymorphism
+If we return to our functor example above, there's one element that may be of
+interest: T with type t = T.t succ
In other words, add a constraint to a signature type. A constraint often seen
+with Mirage (but deprecated now according to this issue) is the
+type io
and its constraint: type 'a io
, with type 'a io = 'a Lwt.t
.
So we had this type in Tar. The problem is that our GADT can't understand that
+sometimes it will have to manipulate Lwt values, sometimes Async or
+sometimes Eio (or Miou!). In other words: how do we compose our Bind
with
+the Bind
of these three targets? The difficulty lies above all in history?
+Supporting this library requires us to assume a certain compatibility with
+applications over which we have no control. What's more, we need to maintain
+support for all three libraries without imposing one.
+
A small disgression at this stage seems important to us, as we've been working
+in this way for over 10 years. Of course, despite all the solutions mentioned
+above, not depending on a system (and/or a scheduler) also allows us to ensure
+the existence of libraries like Tar over more than a decade! The OCaml ecosystem
+is changing, and choosing this or that library to facilitate the development of
+an application has implications we might regret 10 years down the line (for
+example... Cstruct.t
!). So, it can be challenging to ensure compatibility with
+all systems, but the result is libraries steeped in the experience and know-how
+of many developers!
+
So, and this is why we talk about Higher Kinded Polymorphism, how do we abstract
+the t
from 'a t
(to replace it with Lwt.t
or even with a type such as
+type 'a t = 'a
)? This is where we're going to use the trick explained in
+this paper. The trick is to consider a "new type" that will represent our
+monad (lwt or async) and inject/project a value from this monad to something
+understandable by our GADT: High : ('a, 't) io -> ('a, 't) t
.
type ('a, 't) io
+
+type ('a, 'err, 't) t =
+ | Really_read : int -> (string, 'err, 't) t
+ | Read : int -> (string, 'err, 't) t
+ | Seek : int -> (unit, 'err, 't) t
+ | Bind : ('a, 'err, 't) t * ('a -> ('b, 'err, 't) t) -> ('b, 'err, 't) t
+ | Return : ('a, 'err) result -> ('a, 'err, 't) t
+ | Write : string -> (unit, 'err, 't) t
+ | High : ('a, 't) io -> ('a, 'err, 't) t
+
+Next, we need to create this new type according to the chosen scheduler. Let's +take Lwt as an example:
+module Make (X : sig type 'a t end) = struct
+ type t (* our new type *)
+ type 'a s = 'a X.t
+
+ external inj : 'a s -> ('a, t) io = "%identity"
+ external prj : ('a, t) io -> 'a s = "%identity"
+end
+
+module L = Make(Lwt)
+
+let rec run
+ : type a err. (a, err, L.t) t -> (a, err) result Lwt.t
+ = function
+ | High v -> Ok (L.prj v)
+ | Return v -> Lwt.return v
+ | Bind (x, f) ->
+ run x >>= fun value -> run (f value)
+ | _ -> ...
+
+So, as you can see, it's a real trick to avoid doing at home without a
+companion. Indeed, the use of %identity
corresponds to an Obj.magic
! So even
+if the io
type is exposed (to let the user derive Tar for their own system),
+this trick is not exposed for other packages, and we instead suggest helpers
+such as:
val lwt : 'a Lwt.t -> ('a, 'err, lwt) t
+val miou : 'a -> ('a, 'err, miou) t
+
+But this way, Tar can always be derived from another system, and the process for +extracting entries from a Tar file is the same for all systems!
+Conclusion
+This Tar release isn't as impressive as this article, but it does sum up all the +work we've been able to do over the last few months and years. We hope that our +work is appreciated and that this article, which sets out all the thoughts we've +had (and still have), helps you to better understand our work!
+ +