Pushed by YOCaml 2 from dd0fc6e95e

2025-01-14 10:19:29 +00:00 · 2025-01-14 10:19:29 +00:00 · a3cbc49867
commit a3cbc49867
parent 4392088d5a
1 changed files with 391 additions and 0 deletions
--- a/articles/2025-01-07-carton-and-cachet.html
+++ b/articles/2025-01-07-carton-and-cachet.html
@ -0,0 +1,391 @@
 <!doctype html>
 <html lang="en">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="x-ua-compatible" content="ie=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>
        Robur's blog - Git, Carton and emails
    </title>
    <meta name="description" content="A way to store and archive your emails">
    <link type="text/css" rel="stylesheet" href="/css/hl.css">
    <link type="text/css" rel="stylesheet" href="/css/style.css">
    <script src="/js/hl.js"></script>
    <link rel="alternate" type="application/rss+xml" href="/feed.xml" title="blog.robur.coop">
  </head>
  <body>
    <header>
      <h1>blog.robur.coop</h1>
      <blockquote>
        The <strong>Robur</strong> cooperative blog.
      </blockquote>
    </header>
    <main><a href="/index.html">Back to index</a>
 <article>
    <h1>Git, Carton and emails</h1>
    <ul class="tags-list"><li><a href="/tags.html#tag-emails">emails</a></li><li><a href="/tags.html#tag-storage">storage</a></li><li><a href="/tags.html#tag-Git">Git</a></li></ul><p>We are pleased to announce the release of Carton 1.0.0 and Cachet. You can have
 an overview of these libraries in our announcement on the OCaml forum. This
 article goes into more detail about the PACK format and its use for archiving
 your emails.</p>
 <h2 id="back-to-git-and-patches"><a class="anchor" aria-hidden="true" href="#back-to-git-and-patches"></a>Back to Git and patches</h2>
 <p>In our Carton annoucement, we talk about 2 levels of compression for Git
 objects, which are zlib compression and compression between objects using a
 patch.</p>
 <p>Furthermore, if we have 2 blobs (2 versions of a file), one of which contains
 'A' and the other contains 'A+B', the second blob will probably be saved in the
 form of a patch requiring the contents of the first blob and adding '+B'. At a
 higher level and according to our use of Git, we understand that this second
 level of compression is very interesting: we generally just add/remove few lines
 (like introduce a new function) or delete some (removing code) in our files of
 our project.</p>
 <p>However, there is a bias in what Git does and what we perceive. We often think
 that when it comes to patching in the case of Git, we think of the
 <a href="https://opensource.janestreet.com/patdiff/">patience diff</a> or the <a href="https://www.nathaniel.ai/myers-diff/">Eugene Myers diff</a>.
 While the latter offer the advantage of readability in terms of knowing what has
 been added or deleted between two files, they are not necessarily optimal for
 producing a <em>small</em> patch.</p>
 <p>In reality, what interests us in the case of the storage and transmission of
 these patches over the network is not a form of readability in these patches but
 an optimality in what can be considered as common between two files and what is
 not. It is at this stage that the use of <a href="https://github.com/mirage/duff">duff</a> is introduced.</p>
 <p>This is a small library which can generate a patch between two files according
 to the series of bytes common to both files. We're talking about 'series of
 bytes' here because these elements common to our two files are not necessary
 human readable. To find these series of common bytes, we use <a href="https://en.wikipedia.org/wiki/Rabin_fingerprint">Rabin's
 fingerprint</a> algorithm: <a href="https://en.wikipedia.org/wiki/Rolling_hash">a rolling hash</a> used since time
 immemorial.</p>
 <h3 id="patches-and-emails"><a class="anchor" aria-hidden="true" href="#patches-and-emails"></a>Patches and emails</h3>
 <p>So, as far as emails are concerned, it's fairly obvious that there are many
 common &quot;words&quot; to all your emails. The simple word <code>From:</code> should exist in all
 your emails.</p>
 <p>From this simple idea, we can understand the impact, the headers of your emails
 are more or less similar and have more or less the same content. The idea of
 <code>duff</code>, applied to your emails, is to consider these other emails as a
 &quot;slightly&quot; different version of your first email.</p>
 <ol>
 <li>we can store a single raw email</li>
 <li>and we build patches of all your other emails from this first one</li>
 </ol>
 <p>A fairly concrete example of this compression through patches and emails is 2
 notification emails from GitHub: these are quite similar, particularly in the
 header. Even the content is just as similar: the HTML remains the same, only
 the commentary differs.</p>
 <pre><code class="language-shell">$ carton diff github.01.eml github.02.eml -o patch.diff
 $ du -sb github.01.eml github.02.eml patch.diff
 9239	github.01.eml
 9288	github.02.eml
 5136	patch.diff
 </code></pre>
 <p>This example shows that our patch for rebuilding <code>github.02.eml</code> from
 <code>github.01.eml</code> is almost 2 times smaller in size. In this case, with the PACK
 format, this patch will also be compressed with zlib (and we can reach ~2900
 bytes, so 3 times smaller).</p>
 <h4 id="compress-and-compress"><a class="anchor" aria-hidden="true" href="#compress-and-compress"></a>Compress and compress!</h4>
 <p>To put this into perspective, a compression algorithm like zlib can also have
 such a ratio (3 times smaller). But the latter also needs to serialise the
 Huffman tree required for compression (in the general case). What can be
 observed is that concatenating separately compressed emails makes it difficult
 to maintain such a ratio. Worse, concatenating all the emails before compression
 and compressing them all gives us a better ratio!</p>
 <p>That's what the PACK file is all about: the aim is to be able to concatenate
 these compressed emails and keep an interesting overall compression ratio. This
 is the reason for the patch, to reduce the objects even further so that the
 impact of the zlib <em>header</em> on all our objects is minimal and, above
 all, so that we can access to objects <strong>without</strong> having to decompress the
 previous ones (as we would have to do for a <code>*.tar.gz</code> archive, for example).</p>
 <p>The initial intuition about the emails was right, they do indeed share quite a
 few elements together and in the end we were able to save ~4000 bytes in our
 GitHub notification example.</p>
 <h2 id="isomorphism-dkim-and-arc"><a class="anchor" aria-hidden="true" href="#isomorphism-dkim-and-arc"></a>Isomorphism, DKIM and ARC</h2>
 <p>One attribute that we wanted to pay close attention to throughout our
 experimentation was &quot;isomorphism&quot;. This property is very simple: imagine a
 function that takes an email as input and transforms it into another value
 using a method (such as compression). Isomorphism ensures that we can 'undo'
 this method and obtain exactly the same result again:</p>
 <pre><code>  decode(encode(x)) == x
 </code></pre>
 <p>This property is very important for emails because signatures exist in your
 email and these signatures result from the content of your email. If the email
 changes, these signatures change too.</p>
 <p>For instance, the DKIM signature allows you to sign an email and check its
 integrity on receipt. ARC (which will be our next objective) also signs your
 emails, but goes one step further: all the relays that receive your email and
 send it back to the real destination must add a new ARC signature, just like
 adding a new block to the Bitcoin blockchain.</p>
 <p>So you need to make sure that the way you serialise your email (in a PACK file)
 doesn't alter the content in order to keep these signatures valid! It just so
 happens that here too we have a lot of experience with Git. Git has the same
 constraint with <a href="https://en.wikipedia.org/wiki/Merkle_tree">Merkle-Trees</a> and as far as we're concerned, we've
 developed a library that allows you to generate an encoder and a decoder from a
 description and that respects the isomorphism property <em>by construction</em>: the
 <a href="https://github.com/mirage/encore">encore</a> library.</p>
 <p>Then we could store our emails as they are in the PACK file. However, the
 advantage of <code>duff</code> really comes into play when several objects are similar. In
 the case of Git, tree objects are similar but they are not similar with
 commits, for example. For emails, there is also such a distinction: the email
 headers are similar but they are not similar to the email content.</p>
 <p>You can therefore try to &quot;split&quot; emails into 2 parts, the header on one side and
 the content on the other. We would then have a third value which would tell us
 how to reconstruct our complete email (i.e. identify where the header is and
 identify where the content is).</p>
 <p>However, after years of reading email RFCs, things are much more complex. Above
 all, this experience has enabled me to synthesise a skeleton that all emails
 have:</p>
 <pre><code class="language-ocaml">(* multipart-body :=
      [preamble CRLF]
      --boundary transport-padding CRLF
      part
      ( CRLF --boundary transport-padding CRLF part )*
      CRLF
      --boundary-- transport-padding
      [CRLF epilogue]
   part := headers ( CRLF body )?
 *)
 type 'octet body =
  | Multipart of 'octet multipart
  | Single of 'octet option
  | Message of 'octet t
 and 'octet part = { headers : 'octet; body : 'octet body }
 and 'octet multipart =
  { preamble : string
  ; epilogue : string * transport_padding;
  ; boundary : string
  ; parts : (transport_padding * 'octet part) list }
 and 'octet t = 'octet part
 </code></pre>
 <p>As you can see, the distinction is not only between the header and the content
 but also between the parts of an email as soon as it has an attachment. You can
 also have an email inside an email (and I'm always surprised to see that this
 particular case is <em>frequent</em>). Finally, there's the annoying <em>preamble</em> and
 <em>epilogue</em> of an email with several parts, which is often empty but necessary:
 you always have to ensure isomorphism — even for &quot;useless&quot; bytes, they count for
 signatures.</p>
 <p>We'll therefore need to serialise this structure and all we have to do is
 transform a <code>string t</code> and <code>SHA1.t t</code> so that our structure no longer contains
 the actual content of our emails but a unique identifier referring to this
 content and which will be available in our PACK file.</p>
 <pre><code class="language-ocaml">module Format : sig
  val t : SHA1.t Encore.t
 end
 let decode =
  let parser = Encore.to_angstrom Format.t in
  Angstrom.parse_string ~consume:All parser str
 let encode =
  let emitter = Encore.to_lavoisier Format.t in
  Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter
 </code></pre>
 <p>However, we need to check that the isomorphism is respected. You should be
 aware that work on <a href="https://github.com/mirage/mrmime">Mr. MIME</a> has already been done on this subject with
 the <a href="https://afl-1.readthedocs.io/en/latest/fuzzing.html">afl</a> fuzzer: check our assertion <code>x == decode(encode(x))</code>. This
 ability to check isomorphism using afl has enabled us to use the latter to
 generate valid random emails. This allows me to reintroduce you to the
 <a href="https://github.com/mirage/hamlet">hamlet</a> project, perhaps the biggest database of valid — but
 incomprehensible — emails. So we've checked that our encoder/decoder for
 “splitting” our emails respects isomophism on this million emails.</p>
 <h2 id="carton-pop3--mbox-some-metrics"><a class="anchor" aria-hidden="true" href="#carton-pop3--mbox-some-metrics"></a>Carton, POP3 &amp; mbox, some metrics</h2>
 <p>We can therefore split an email into several parts and calculate an optimal
 patch between two similar pieces of content. So now you can start packaging!
 This is where I'm going to reintroduce a tool that hasn't been released yet,
 but which allows me to go even further with emails: <a href="https://github.com/dinosaure/blaze">blaze</a>.</p>
 <p>This little tool is my <em>Swiss army knife</em> for emails! And it's in this tool
 that we're going to have fun deriving Carton so that it can manipulate emails
 rather than Git objects. So we've implemented the very basic <a href="https://en.wikipedia.org/wiki/Post_Office_Protocol">POP3</a>
 protocol (and thanks to <a href="https://github.com/mirleft/ocaml-tls">ocaml-tls</a> for offering a free encrypted
 connection) as well as the <a href="https://en.wikipedia.org/wiki/Mbox">mbox</a> format.</p>
 <p>Both are <strong>not</strong> recommended. The first is an old protocol and interacting with
 Gmail, for example, is very slow. The second is an old, non-standardised format
 for storing your emails — and unfortunately this may be the format used by your
 email client. After resolving a few bugs such as the unspecified behaviour of
 pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up
 with lots of emails that you'll have fun packaging!</p>
 <pre><code class="language-shell">$ mkdir mailbox
 $ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \
  -u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' &gt; mails.lst
 $ blaze.pack make -o mailbox.pack mails.lst
 $ tar czf mailbox.tar.gz mailbox
 $ du -sh mailbox mailbox.pack mailbox.tar.gz
 97M     mailbox
 28M     mailbox.pack
 23M     mailbox.tar.gz
 </code></pre>
 <p>In this example, we download the latest emails from the last 30 days via POP3
 and store them in the <code>mailbox/</code> folder. This folder weighs 97M and if we
 compress it with gzip, we end up with 23M. The problem is that we need to
 decompress the <code>mailbox.tar.gz</code> document to extract the emails.</p>
 <p>This is where the PACK file comes in handy: it only weighs 28M (so we're very
 close to what <code>tar</code> and <code>gzip</code> can do) but we can rebuild our emails without
 unpacking everything:</p>
 <pre><code class="language-shell">$ blaze.pack index mailbox.pack
 $ blaze.pack list mailbox.pack | head -n1
 0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33
 $ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33
 Delivered-To: romain.calascibetta@gmail.com
 ...
 </code></pre>
 <p>Like Git, we now associate a hash with our emails and can retrieve them using
 this hash. Like Git, we also calculate the <code>*.idx</code> file to associate the hash
 with the position of the email in our PACK file. Just like Git (with <code>git show</code>
 or <code>git cat-file</code>), we can now access our emails very quickly. So we now have a
 database system (read-only) for our emails: we can now archive our emails!</p>
 <p>Let's have a closer look at this PACK file. We've developed a tool more or less
 similar to <code>git verify-pack</code> which lists all the objects in our PACK file and,
 above all, gives us information such as the number of bytes needed to store
 these objects:</p>
 <pre><code class="language-shell">$ blaze.pack verify mailbox.pack
 4e9795e268313245f493d9cef1b5ccf30cc92c33 a       12     186    6257b7d4
 ...
 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b   666027     223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a
 </code></pre>
 <p>It shows the hash of our object, its type (A for the structure of our email, B
 for the content), its position in the PACK file, the number of bytes used to
 store the object (!) and finally the depth of the patch, the checksum, and the
 source of the patch needed to rebuild the object.</p>
 <p>Here, our first object is not patched, but the next object is. Note that it
 only needs 223 bytes in the PACK file. But what is the real size of this
 object?</p>
 <pre><code class="language-shell">$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \
  --raw --without-metadata | wc -c
 2014
 </code></pre>
 <p>So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of
 10! In this case, the object is the content of an email. Guess which one? A
 GitHub notification! If we go back to our very first example, we saw that we
 could compress with a ratio close to 2. We could go further with zlib: we
 compress the patch too. This example allows us to introduce one last feature of
 PACK files: the depth.</p>
 <pre><code class="language-shell">$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8
 kind:         b
 length:       2014 byte(s)
 depth:        10
 cache misses: 586
 cache hits:   0
 tree:           000026ab
              Δ 00007f78
              ...
              Δ 0009ef74
              Δ 000a29ab
 ...
 </code></pre>
 <p>In our example, our object requires a source which, in turn, is a patch
 requiring another source, and so on (you can see this chain in the <code>tree</code>).
 The length of this patch chain corresponds to the depth of our object. There is
 therefore a succession of patches between objects. What Carton tries to do is
 to find the best patch from a window of possibilities and keep the best
 candidates for reuse. If we unroll this chain of patches, we find a &quot;base&quot;
 object (at <code>0x000026ab</code>) that is just compressed with zlib and the latter is
 also the content of another GitHub notification email. This shows that Carton
 is well on its way to finding the best candidate for the patch, which should be
 similar content, moreover, another GitHub notification.</p>
 <p>The idea is to sacrifice a little computing time (in the reconstruction of
 objects via their patches) to gain in compression ratio. It's fair to say that
 a very long patch chain can degrade performance. However, there is a limit in
 Git and Carton: a chain can't be longer than 50. Another point is the search for
 the candidate source for the patch, which is often physically close to the patch
 (within a few bytes): reading the PACK file by page (thanks to [Cachet][cachet])
 sometimes gives access to 3 or 4 objects, which have a certain chance of being
 patched together.</p>
 <p>Let's take the example of Carton and a Git object:</p>
 <pre><code class="language-shell">$ carton get pack-*.idx eaafd737886011ebc28e6208e03767860c22e77d
 ...
 cache misses: 62
 cache hits:   758
 tree:           160720bb
              Δ 160ae4bc
              Δ 160ae506
              Δ 160ae575
              Δ 160ae5be
              Δ 160ae5fc
              Δ 160ae62f
              Δ 160ae667
              Δ 160ae6a5
              Δ 160ae6db
              Δ 160ae72a
              Δ 160ae766
              Δ 160ae799
              Δ 160ae81e
              Δ 160ae858
              Δ 16289943
 </code></pre>
 <p>We can see here that we had to load 62 pages, but that we also reused the pages
 we'd already read 758 times. We can also see that the offset of the patches
 (which can be seen in Tree) is always close (the objects often follow each
 other).</p>
 <h3 id="mbox-and-real-emails"><a class="anchor" aria-hidden="true" href="#mbox-and-real-emails"></a>Mbox and real emails</h3>
 <p>In a way, the concrete cases we use here are my emails. There may be a fairly
 simple bias, which is that all these emails have the same destination:
 romain.calascibetta@gmail.com. This is a common point which can also have a
 significant impact on compression with <code>duff</code>. We will therefore try another
 corpus, the archives of certain mailing lists relating to OCaml:
 <a href="https://github.com/ocaml/lists.ocaml.org-archive">lists.ocaml.org-archive</a></p>
 <pre><code class="language-shell">$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
  -o opam-devel.pack
 $ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
  &gt; opam-devel.mbox.gzip
 $ du -sh opam-devel.pack opam-devel.mbox.gzip \
  lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
 3.9M    opam-devel.pack
 2.0M    opam-devel.mbox.gzip
 10M     lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
 </code></pre>
 <p>The compression ratio is a bit worse than before, but we're still on to
 something interesting. Here again we can take an object from our PACK file and
 see how the compression between objects reacts:</p>
 <pre><code class="language-shell">$ blaze.pack index opam-devel.pack
 ...
 09bbd28303c8aafafd996b56f9c071a3add7bd92 b  362504   271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35
 $ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
  --without-metadata --raw | wc -c
 2098
 </code></pre>
 <p>Once again, we see a ratio of 10! Here the object corresponds to the header of
 an email. This is compressed with other email headers. This is the situation
 where the fields are common to all your emails (<code>From</code>, <code>Subject</code>, etc.).
 Carton successfully patches headers together and email content together.</p>
 <h2 id="next-things"><a class="anchor" aria-hidden="true" href="#next-things"></a>Next things</h2>
 <p>All the work done on email archiving is aimed at producing a unikernel (<code>void</code>)
 that could archive all incoming emails. Finally, this unikernel could send the
 archive back (via an email!) to those who want it. This is one of our goals for
 implementing a mailing list in OCaml with unikernels.</p>
 <p>Another objective is to create a database system for emails in order to offer
 two features to the user that we consider important:</p>
 <ul>
 <li>quick and easy access to emails</li>
 <li>save disk space through compression</li>
 </ul>
 <p>With this system, we can extend the method of indexing emails with other
 information such as the keywords found in the emails. This will enable us,
 among other things, to create an email search engine!</p>
 <h2 id="conclusion"><a class="anchor" aria-hidden="true" href="#conclusion"></a>Conclusion</h2>
 <p>This milestone in our PTT project was quite long, as we were very interested in
 metrics such as compression ratio and software execution speed.</p>
 <p>The experience we'd gained with emails (in particular with Mr. MIME) enabled us
 to move a little faster, especially in terms of serializing our emails. Our
 experience with ocaml-git also enabled us to identify the benefits of the PACK
 file for emails.</p>
 <p>But the development of <a href="https://github.com/robur-coop/miou">Miou</a> was particularly helpful in satisfying us in
 terms of program execution time, thanks to the ability to parallelize certain
 calculations quite easily.</p>
 <p>The format is still a little rough and not quite ready for the development of a
 keyword-based e-mail indexing system, but it provides a good basis for the rest
 of our project.</p>
 <p>So, if you like what we're doing and want to help, you can make a donation via
 <a href="https://github.com/sponsors/robur-coop">GitHub</a> or using our <a href="https://robur.coop/Donate">IBAN</a>.</p>
    </article>
        </main>
    <footer>
        <a href="https://github.com/xhtmlboi/yocaml">Powered by <strong>YOCaml</strong></a>
        <br />
    </footer>
    <script>hljs.highlightAll();</script>
  </body>
 </html>