diff --git a/articles/tar-release.html b/articles/tar-release.html new file mode 100644 index 0000000..91ed602 --- /dev/null +++ b/articles/tar-release.html @@ -0,0 +1,406 @@ + + + + + + + + + Robur's blog - The new Tar release, a retrospective + + + + + + + + +
+

blog.robur.coop

+
+ The Robur cooperative blog. +
+
+
Back to index + +
+

The new Tar release, a retrospective

+

We are delighted to announce the new release of ocaml-tar. A small library for +reading and writing tar archives in OCaml. Since this is a major release, we'll +take the time in this article to explain the work that's been done by the +cooperative on this project.

+

Tar is an old project. Originally written by David Scott as part of Mirage, +this project is particularly interesting for building bridges between the tools +we can offer and what already exists. Tar is, in fact, widely used. So we're +both dealing with a format that's older than I am (but I'm used to it by email) +and a project that's been around since... 2012 (over 10 years!).

+

But we intend to maintain and improve it, since we're using it for the +opam-mirror project among other things - this unikernel is to +provide an opam-repository "tarball" for opam when you do opam update.

+

Cstruct.t & bytes

+

As some of you may have noticed, over the last few months we've begun a fairly +substantial change to the Mirage ecosystem, replacing the use of Cstruct.t in +key places with bytes/string.

+

This choice is based on 2 considerations:

+ +

The pull-request is available here: https://github.com/mirage/ocaml-tar/pull/137. +The discussion can be interesting in discovering common bugs (uninitialized +buffer, invalid access). There's also a small benchmark to support our initial +intuition1.

+

But this PR can also be an opportunity to understand the existence of +Cstruct.t in the Mirage ecosystem and the reasons for this historic choice.

+

Cstruct.t as a non-moveable data

+

I've already made a list of pros/cons when it comes to +bigarrays. Indeed, Cstruct.t is based on a bigarray:

+
type buffer = (char, Bigarray.int8_unsigned_elt, Bigarray.c_layout) Bigarray.Array1.t
+
+type t =
+  { buffer : buffer
+  ; off : int
+  ; len : int }
+
+

The experienced reader may rightly wonder why Cstruct.t is a bigarray with off +and len. First, we need to clarify what a bigarray is for OCaml.

+

A bigarray is a somewhat special value in OCaml. This value is allocated in the +C heap. In other words, its contents are not in OCaml's garbage collector, but +exist outside it. The first (and very important) implication of this feature is +that the contents of a bigarray do not move (even if the GC tries to defragment +the memory). This feature has several advantages:

+ +

All these examples show that bigarrays can be of real interest as long as +their uses are properly contextualized - which ultimately remains very +specific. Our experience of using them in Mirage has shown us their advantages, +but also, and above all, their disadvantages:

+ +

Ownership, proxy and GC

+

This last point deserves a little clarification, particularly with regard to the +Bigarray.sub function. This function will not create a new, smaller bigarray +and copy what was in the old one to the new one (as Bytes.sub/String.sub +does). In fact, OCaml will allocate a "proxy" of your bigarray that represents a +subfield. This is where reference-counting comes in. This proxy value needs +the initial bigarray to be manipulated. So, as long as proxies exist, the GC +cannot free() the initial bigarray.

+

This poses several problems:

+ +

As far as the first point is concerned, Bigarray.sub could still be "slow" for +small data since it was, de facto (since a bigarray always has a finalizer - +don't forget reference counting!), allocated in the major heap. And, in truth, +this is perhaps the main reason for the existence of Cstruct! To have a "proxy" +to a bigarray allocated in the minor heap (and, be fast). But since +Pierre Chambart's PR#92, the problem is no more.

+

The second point, on the other hand, is still topical, even if we can see that +considerable efforts have been made. What we see every +day on our unikernels is the pressure that can be put on +the GC when it comes to bigarrays. Indeed, bigarrays use memory and making the C +heap cohabit with the OCaml heap inevitably comes at a cost. As far as +unikernels are concerned, which have a more limited memory than an OCaml +application, we reach this limit rather quickly and we therefore ask the GC to +work more specifically on our 10 or 20 byte bigarrays...

+

Finally, the third point can be the toughest. On several occasions, we've +noticed competing accesses on our bigarrays that we didn't want (for example, +http-lwt-client had this problem). In our experience, +it's very difficult to observe and know that there is indeed an unauthorized +concurrent access changing the contents of our buffer. In this respect, the +question remains open as regards Cstruct.t and the possibility of encoding +ownership of a Cstruct.t in the type to prevent unauthorized access. +This PR is interesting to see all the discussions that have taken +place on this subject2.

+

It should be noted that, with regard to the third point, the problem also +applies to bytes and the use of Bytes.unsafe_to_string!

+

Conclusion about Cstruct

+

We hope we've been thorough enough in our experience with Cstruct. If we go back +to the initial definition of our Cstruct.t shown above and take all the +history into account, it becomes increasingly difficult to argue for a +systematic use of Cstruct in our unikernels. In fact, the question of +Cstruct.t versus bytes/string remains completely open.

+

It's worth noting that the original reasons for Cstruct.t are no longer really +relevant if we consider how OCaml has evolved. It should also be noted that this +systematic approach to using Cstruct.t rather than bytes/string has cost us.

+

This is not to say that Cstruct.t is obsolete. The library is very good and +offers an API where manipulating bytes to extract information such as a TCP/IP +packet remains more pleasant than directly using bytes (even if, here too, +efforts have been made).

+

As far as ocaml-tar is concerned, what really counts is the possibility for +other projects to use this library without requiring Cstruct.t - thus +facilitating its adoption. In other words, given the advantages/disadvantages of +Cstruct.t, we felt it would be a good idea to remove this dependency.

+
+

1: It should be noted that the benchmark also concerns +compression. In this case, we use decompress, which uses bigarrays. So there's +some copying involved (from bytes to bigarrays)! But despite this copying, it +seems that the change is worthwhile.

+

2: It reminds me that we've been experimenting with +capabilities and using the type system to enforce certain characteristics. To +date, Cstruct_cap has not been used anywhere, which raises a real question +about the advantages/disadvantages in everyday use.

+

Functors

+

This is perhaps the other point of the Mirage ecosystem that is also the subject +of debate. Functors! Before we talk about functors, we need to understand their +relevance in the context of Mirage.

+

Mirage transforms an application into an operating system. What's the difference +between a "normal" application and a unikernel: the "subsystem" with which you +interact. In this case, a normal application will interact with the host system, +whereas a unikernel will have to interact with the Solo5 mini-system.

+

What Mirage is trying to offer is the ability for an application to transform +itself into either without changing a thing! Mirage's aim is to inject the +subsystem into your application. In this case:

+ +

So we're not going to talk about the pros and cons of this approach here, but +consider this feature as one that requires us to use functors.

+

Indeed, what's the best way in OCaml to inject one implementation into another: +functors? There are definite advantages here too, but we're going to concentrate +on one in particular: the expressiveness of types at module level (which can be +used as arguments to our functors).

+

For example, did you know that OCaml has a dependent type system?

+
type 'a nat = Zero : zero nat | Succ : 'a nat -> 'a succ nat
+and zero = |
+and 'a succ = S
+
+module type T = sig type t val v : t nat end
+module type Rec = functor (T:T) -> T
+module type Nat = functor (S:Rec) -> functor (Z:T) -> T
+
+module Zero = functor (S:Rec) -> functor (Z:T) -> Z
+module Succ = functor (N:Nat) -> functor (S:Rec) -> functor (Z:T) -> S(N(S)(Z))
+module Add = functor (X:Nat) -> functor (Y:Nat) -> functor (S:Rec) -> functor (Z:T) -> X(S)(Y(S)(Z))
+
+module One = Succ(Zero)
+module Two_a = Add(One)(One)
+module Two_b = Succ(One)
+
+module Z : T with type t = zero = struct
+  type t = zero
+  let v = Zero
+end
+
+module S (T:T) : T with type t = T.t succ = struct
+  type t = T.t succ
+  let v = Succ T.v
+end
+
+module A = Two_a(S)(Z)
+module B = Two_b(S)(Z)
+
+type ('a, 'b) refl = Refl : ('a, 'a) refl
+
+let _ : (A.t, B.t) refl = Refl (* 1+1 == succ 1 *)
+
+

The code is ... magical, but it shows that two differently constructed modules +(Two_a & Two_b) ultimately produce the same type, and OCaml is able to prove +this equality. Above all, the example shows just how powerful functors can be. +But it also shows just how difficult functors can be to understand and use.

+

In fact, this is one of Mirage's biggest drawbacks: the overuse of functors +makes the code difficult to read and understand. It can be difficult to deduce +in your head the type that results from an application of functors, and the +constraints associated with it... (yes, I don't use merlin).

+

But back to our initial problem: injection! In truth, the functor is a +fly-killing sledgehammer in most cases. There are many other ways of injecting +what the system would be (and how to do a read or write) into an +implementation. The best example, as @nojb pointed out, is of +course ocaml-tls - this answer also shows a contrast between the +functor approach (with CoHTTP for example) and the "pure value-passing +interface" of ocaml-tls.

+

What's more, we've been trying to find other approaches for injecting the system +we want for several years now. We can already list several:

+ +

In short, based on this list and the various experiments we've carried out on a +number of projects, we've decided to remove the functors from ocaml-tar! The +crucial question now is: which method to choose?

+

The best answers

+

There's no real answer to that, and in truth it depends on what level of +abstraction you're at. In fact, you'd like to have a fairly simple method of +abstraction from the system at the start and at the lowest level, to end up +proposing a functor that does all the ceremony (the glue between your +implementation and the system) at the end - that's what ocaml-git +does, for example.

+

The abstraction you choose also depends on how the process is going to work. As +far as streams/protocols are concerned, the ocaml-tls/decompress approach +still seems the best. But when it comes to introspecting a file/block-device, it +may be preferable to use a GADT that will force the user to implement an +arbitrary memory access rather than consume a sequence of bytes. In short, at +this stage, experience speaks for itself and, just as we were wrong about +functors, we won't be advising you to use this or that solution.

+

But based on our experience of ocaml-tls & decompress with LZO (which +requires arbitrary access to the content) and the way Tar works, we decided to +use a "value-passing" approach (to describe when we need to read/write) and a +GADT to describe calculations such as:

+ +
val decode : decode_state -> string ->
+  decode_state *
+   * [ `Read of int
+     | `Skip of int
+     | `Header of Header.t ] option
+   * Header.Extended.t option
+(** [decode state] returns a new state and what the user should do next:
+    - [`Skip] skip bytes
+    - [`Read] read bytes
+    - [`Header hdr] do something according the last header extracted
+      (like stream-out the contents of a file). *)
+
+type ('a, 'err) t =
+  | Really_read : int -> (string, 'err) t
+  | Read : int -> (string, 'err) t
+  | Seek : int -> (unit, 'err) t
+  | Bind : ('a, 'err) t * ('a -> ('b, 'err) t) -> ('b, 'err) t
+  | Return : ('a, 'err) result -> ('a, 'err) t
+  | Write : string -> (unit, 'err) t
+
+

However, and this is where we come back to OCaml's limitations and where +functors could help us: higher kinded polymorphism!

+

Higher kinded Polymorphism

+

If we return to our functor example above, there's one element that may be of +interest: T with type t = T.t succ

+

In other words, add a constraint to a signature type. A constraint often seen +with Mirage (but deprecated now according to this issue) is the +type io and its constraint: type 'a io, with type 'a io = 'a Lwt.t.

+

So we had this type in Tar. The problem is that our GADT can't understand that +sometimes it will have to manipulate Lwt values, sometimes Async or +sometimes Eio (or Miou!). In other words: how do we compose our Bind with +the Bind of these three targets? The difficulty lies above all in history? +Supporting this library requires us to assume a certain compatibility with +applications over which we have no control. What's more, we need to maintain +support for all three libraries without imposing one.

+
+

A small disgression at this stage seems important to us, as we've been working +in this way for over 10 years. Of course, despite all the solutions mentioned +above, not depending on a system (and/or a scheduler) also allows us to ensure +the existence of libraries like Tar over more than a decade! The OCaml ecosystem +is changing, and choosing this or that library to facilitate the development of +an application has implications we might regret 10 years down the line (for +example... Cstruct.t!). So, it can be challenging to ensure compatibility with +all systems, but the result is libraries steeped in the experience and know-how +of many developers!

+
+

So, and this is why we talk about Higher Kinded Polymorphism, how do we abstract +the t from 'a t (to replace it with Lwt.t or even with a type such as +type 'a t = 'a)? This is where we're going to use the trick explained in +this paper. The trick is to consider a "new type" that will represent our +monad (lwt or async) and inject/project a value from this monad to something +understandable by our GADT: High : ('a, 't) io -> ('a, 't) t.

+
type ('a, 't) io
+
+type ('a, 'err, 't) t =
+  | Really_read : int -> (string, 'err, 't) t
+  | Read : int -> (string, 'err, 't) t
+  | Seek : int -> (unit, 'err, 't) t
+  | Bind : ('a, 'err, 't) t * ('a -> ('b, 'err, 't) t) -> ('b, 'err, 't) t
+  | Return : ('a, 'err) result -> ('a, 'err, 't) t
+  | Write : string -> (unit, 'err, 't) t
+  | High : ('a, 't) io -> ('a, 'err, 't) t
+
+

Next, we need to create this new type according to the chosen scheduler. Let's +take Lwt as an example:

+
module Make (X : sig type 'a t end) = struct
+  type t (* our new type *)
+  type 'a s = 'a X.t
+  
+  external inj : 'a s -> ('a, t) io = "%identity"
+  external prj : ('a, t) io -> 'a s = "%identity"
+end
+
+module L = Make(Lwt)
+
+let rec run
+  : type a err. (a, err, L.t) t -> (a, err) result Lwt.t
+  = function
+  | High v -> Ok (L.prj v)
+  | Return v -> Lwt.return v
+  | Bind (x, f) ->
+    run x >>= fun value -> run (f value)
+  | _ -> ...
+
+

So, as you can see, it's a real trick to avoid doing at home without a +companion. Indeed, the use of %identity corresponds to an Obj.magic! So even +if the io type is exposed (to let the user derive Tar for their own system), +this trick is not exposed for other packages, and we instead suggest helpers +such as:

+
val lwt : 'a Lwt.t -> ('a, 'err, lwt) t
+val miou : 'a -> ('a, 'err, miou) t
+
+

But this way, Tar can always be derived from another system, and the process for +extracting entries from a Tar file is the same for all systems!

+

Conclusion

+

This Tar release isn't as impressive as this article, but it does sum up all the +work we've been able to do over the last few months and years. We hope that our +work is appreciated and that this article, which sets out all the thoughts we've +had (and still have), helps you to better understand our work!

+ +
+ +
+ + + + diff --git a/feed.xml b/feed.xml index ebc72e4..e32537e 100644 --- a/feed.xml +++ b/feed.xml @@ -1 +1 @@ -Robur's bloghttps://blog.robur.coopThe Robur cooperative blogyocamlteam@robur.coopTesting MirageVPN against OpenVPN™https://blog.robur.coop/articles/miragevpn-testing.htmlWed, 26 Jun 2024 10:00:00 GMTSome notes about how we test MirageVPN against OpenVPN™https://blog.robur.coop/articles/miragevpn-testing.htmlqubes-miragevpn, a MirageVPN client for QubesOShttps://blog.robur.coop/articles/qubes-miragevpn.htmlMon, 24 Jun 2024 10:00:00 GMTA new OpenVPN client for QubesOShttps://blog.robur.coop/articles/qubes-miragevpn.htmlMirageVPN serverhttps://blog.robur.coop/articles/miragevpn-server.htmlMon, 17 Jun 2024 10:00:00 GMTAnnouncement of our MirageVPN server.https://blog.robur.coop/articles/miragevpn-server.htmlSpeeding up MirageVPN and use it in the wildhttps://blog.robur.coop/articles/miragevpn-performance.htmlTue, 16 Apr 2024 10:00:00 GMTPerformance engineering of MirageVPN, speeding it up by a factor of 25.https://blog.robur.coop/articles/miragevpn-performance.htmlGPTarhttps://blog.robur.coop/articles/gptar.htmlWed, 21 Feb 2024 10:00:00 GMTHybrid GUID partition table and tar archivehttps://blog.robur.coop/articles/gptar.htmlSpeeding elliptic curve cryptographyhttps://blog.robur.coop/articles/speeding-ec-string.htmlTue, 13 Feb 2024 10:00:00 GMTHow we improved the performance of elliptic curves by only modifying the underlying byte arrayhttps://blog.robur.coop/articles/speeding-ec-string.htmlCooperation and Lwt.pausehttps://blog.robur.coop/articles/lwt_pause.htmlSun, 11 Feb 2024 10:00:00 GMTA disgression about Lwt and Miouhttps://blog.robur.coop/articles/lwt_pause.htmlPython's `str.__repr__()`https://blog.robur.coop/articles/2024-02-03-python-str-repr.htmlSat, 03 Feb 2024 10:00:00 GMTReimplementing Python string escaping in OCamlhttps://blog.robur.coop/articles/2024-02-03-python-str-repr.htmlMirageVPN updated (AEAD, NCP)https://blog.robur.coop/articles/miragevpn-ncp.htmlMon, 20 Nov 2023 10:00:00 GMTHow we resurrected MirageVPN from its bitrot statehttps://blog.robur.coop/articles/miragevpn-ncp.htmlMirageVPN & tls-crypt-v2https://blog.robur.coop/articles/miragevpn.htmlTue, 14 Nov 2023 10:00:00 GMTHow we implementated tls-crypt-v2 for miragevpnhttps://blog.robur.coop/articles/miragevpn.html \ No newline at end of file +Robur's bloghttps://blog.robur.coopThe Robur cooperative blogyocamlteam@robur.coopThe new Tar release, a retrospectivehttps://blog.robur.coop/articles/tar-release.htmlThu, 15 Aug 2024 10:00:00 GMTA little retrospective to the new Tar release and changeshttps://blog.robur.coop/articles/tar-release.htmlqubes-miragevpn, a MirageVPN client for QubesOShttps://blog.robur.coop/articles/qubes-miragevpn.htmlMon, 24 Jun 2024 10:00:00 GMTA new OpenVPN client for QubesOShttps://blog.robur.coop/articles/qubes-miragevpn.htmlMirageVPN serverhttps://blog.robur.coop/articles/miragevpn-server.htmlMon, 17 Jun 2024 10:00:00 GMTAnnouncement of our MirageVPN server.https://blog.robur.coop/articles/miragevpn-server.htmlSpeeding up MirageVPN and use it in the wildhttps://blog.robur.coop/articles/miragevpn-performance.htmlTue, 16 Apr 2024 10:00:00 GMTPerformance engineering of MirageVPN, speeding it up by a factor of 25.https://blog.robur.coop/articles/miragevpn-performance.htmlGPTarhttps://blog.robur.coop/articles/gptar.htmlWed, 21 Feb 2024 10:00:00 GMTHybrid GUID partition table and tar archivehttps://blog.robur.coop/articles/gptar.htmlSpeeding elliptic curve cryptographyhttps://blog.robur.coop/articles/speeding-ec-string.htmlTue, 13 Feb 2024 10:00:00 GMTHow we improved the performance of elliptic curves by only modifying the underlying byte arrayhttps://blog.robur.coop/articles/speeding-ec-string.htmlCooperation and Lwt.pausehttps://blog.robur.coop/articles/lwt_pause.htmlSun, 11 Feb 2024 10:00:00 GMTA disgression about Lwt and Miouhttps://blog.robur.coop/articles/lwt_pause.htmlPython's `str.__repr__()`https://blog.robur.coop/articles/2024-02-03-python-str-repr.htmlSat, 03 Feb 2024 10:00:00 GMTReimplementing Python string escaping in OCamlhttps://blog.robur.coop/articles/2024-02-03-python-str-repr.htmlMirageVPN updated (AEAD, NCP)https://blog.robur.coop/articles/miragevpn-ncp.htmlMon, 20 Nov 2023 10:00:00 GMTHow we resurrected MirageVPN from its bitrot statehttps://blog.robur.coop/articles/miragevpn-ncp.htmlMirageVPN & tls-crypt-v2https://blog.robur.coop/articles/miragevpn.htmlTue, 14 Nov 2023 10:00:00 GMTHow we implementated tls-crypt-v2 for miragevpnhttps://blog.robur.coop/articles/miragevpn.html \ No newline at end of file diff --git a/index.html b/index.html index 84ebb6f..1b187c1 100644 --- a/index.html +++ b/index.html @@ -27,15 +27,15 @@
  1. - 2024-06-26 - Testing MirageVPN against OpenVPN™
    -

    Some notes about how we test MirageVPN against OpenVPN™

    + 2024-08-15 + The new Tar release, a retrospective
    +

    A little retrospective to the new Tar release and changes

  2. diff --git a/tags/community.html b/tags/community.html index e0a0d9e..6601dcd 100644 --- a/tags/community.html +++ b/tags/community.html @@ -23,7 +23,7 @@
    Back to index - +
    diff --git a/tags/cstruct.html b/tags/cstruct.html new file mode 100644 index 0000000..4341806 --- /dev/null +++ b/tags/cstruct.html @@ -0,0 +1,41 @@ + + + + + + + + + Robur's blog + + + + + + + + +
    +

    blog.robur.coop

    +
    + The Robur cooperative blog. +
    +
    +
    Back to index + + + +

    + cstruct + 1 entry

    + +
    + +
    + + + + diff --git a/tags/functors.html b/tags/functors.html new file mode 100644 index 0000000..4c3b280 --- /dev/null +++ b/tags/functors.html @@ -0,0 +1,41 @@ + + + + + + + + + Robur's blog + + + + + + + + +
    +

    blog.robur.coop

    +
    + The Robur cooperative blog. +
    +
    +
    Back to index + + + +

    + functors + 1 entry

    + +
    + +
    + + + + diff --git a/tags/git.html b/tags/git.html index d8c9a84..fcbdb5f 100644 --- a/tags/git.html +++ b/tags/git.html @@ -23,7 +23,7 @@
    Back to index - +

    git diff --git a/tags/gpt.html b/tags/gpt.html index d313322..94dc49d 100644 --- a/tags/gpt.html +++ b/tags/gpt.html @@ -23,7 +23,7 @@
    Back to index - +

    gpt diff --git a/tags/mbr.html b/tags/mbr.html index 4fa8d4f..8d075bf 100644 --- a/tags/mbr.html +++ b/tags/mbr.html @@ -23,7 +23,7 @@
    Back to index - +
    diff --git a/tags/ocaml.html b/tags/ocaml.html index 0e0d390..fd12954 100644 --- a/tags/ocaml.html +++ b/tags/ocaml.html @@ -23,12 +23,12 @@
    Back to index - +
    diff --git a/tags/performance.html b/tags/performance.html index 5901981..0983e92 100644 --- a/tags/performance.html +++ b/tags/performance.html @@ -23,7 +23,7 @@
    Back to index - +

    performance diff --git a/tags/persistent storage.html b/tags/persistent storage.html index b134e69..0a5f3b7 100644 --- a/tags/persistent storage.html +++ b/tags/persistent storage.html @@ -23,7 +23,7 @@
    Back to index - +

    persistent storage diff --git a/tags/python.html b/tags/python.html index 356e3d6..0cd1b0d 100644 --- a/tags/python.html +++ b/tags/python.html @@ -23,7 +23,7 @@
    Back to index - +

    python diff --git a/tags/qubesos.html b/tags/qubesos.html index 1975235..b7bcbeb 100644 --- a/tags/qubesos.html +++ b/tags/qubesos.html @@ -23,7 +23,7 @@
    Back to index - +

    qubesos diff --git a/tags/scheduler.html b/tags/scheduler.html index 93eb061..43f6e93 100644 --- a/tags/scheduler.html +++ b/tags/scheduler.html @@ -23,7 +23,7 @@
    Back to index - +
    diff --git a/tags/tar.html b/tags/tar.html index d2c68fb..ab5b5fc 100644 --- a/tags/tar.html +++ b/tags/tar.html @@ -23,7 +23,7 @@
    Back to index - +

    tar diff --git a/tags/unicode.html b/tags/unicode.html index 3ef777f..496abb7 100644 --- a/tags/unicode.html +++ b/tags/unicode.html @@ -23,7 +23,7 @@
    Back to index - +

    unicode diff --git a/tags/unikernel.html b/tags/unikernel.html index 8e9c1ba..f9cd1d5 100644 --- a/tags/unikernel.html +++ b/tags/unikernel.html @@ -23,7 +23,7 @@
    Back to index - +