From 8d9c397d1a06427ceb833a8fd0ae87675ce1bc52 Mon Sep 17 00:00:00 2001
From: The Robur team <team@robur.coop>
Date: Fri, 17 Jan 2025 16:38:10 +0000
Subject: [PATCH] Pushed by YOCaml 2 from
 d74e429ba7767645a0da0af3cd39e28920eb6d82

---
 articles/2025-01-07-carton-and-cachet.html | 114 +++++++++++++++++++++
 1 file changed, 114 insertions(+)
diff --git a/articles/2025-01-07-carton-and-cachet.html b/articles/2025-01-07-carton-and-cachet.html
index 9174e2b..75eae89 100644
--- a/articles/2025-01-07-carton-and-cachet.html
+++ b/articles/2025-01-07-carton-and-cachet.html
@@ -378,6 +378,120 @@ keyword-based e-mail indexing system, but it provides a good basis for the rest
 of our project.</p>
 <p>So, if you like what we're doing and want to help, you can make a donation via
 <a href="https://github.com/sponsors/robur-coop">GitHub</a> or using our <a href="https://robur.coop/Donate">IBAN</a>.</p>
+<hr />
+<h2 id="post"><a class="anchor" aria-hidden="true" href="#post"></a>Post</h2>
+<p>This little note is an extension of the feedback we got on the Carton release.
+<a href="https://discuss.ocaml.org/t/ann-release-of-carton-1-0-0-and-cachet/15953/2?u=dinosaure">nojb</a>, in this case, pointed to the <a href="https://public-inbox.org/README.html">public-inbox</a>
+software as the archiver of the various Linux kernel mailing lists. The latter
+is based on the same intuition we had, namely to use the PACK format to archive
+emails.</p>
+<p>The question then arises: are we starting to remake the wheel?</p>
+<p>In truth, the devil is in the detail. As it happens, you can download LKML
+mailing list archives with Git in this way:</p>
+<pre><code class="language-shell">$ git clone --mirror http://lore.kernel.org/lkml/15 lkml/git/15.git
+$ cd lkml/git/15.git
+$ du -sh objects/pack/pack-*.pack
+981M	objects/pack/pack-*.pack
+$ cd objects/pack/
+$ mkdir loose
+$ carton explode 'loose/%s/%s' pack-*.pack
+$ du -sh loose/c/
+2.7G	loose/c
+</code></pre>
+<p><code>public-inbox</code> is based not only on the PACK format for email archiving, but
+also on Git concepts. In this case, such a Git repository actually only contains
+an <code>m</code> file corresponding to the last email received on the mailing list. The
+other e-mails are &quot;old versions of this e-mail&quot;. In this case, <code>public-inbox</code>
+considers a certain form of <em>versioning</em> between emails. Each commit is a new
+email and will &quot;replace&quot; the previous one.</p>
+<h3 id="heuristics-to-patch"><a class="anchor" aria-hidden="true" href="#heuristics-to-patch"></a>Heuristics to patch</h3>
+<p><code>public-inbox</code> then relies on the heuristics implemented by Git to find the best
+candidate for patching emails. These heuristics are explained
+<a href="https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt">here</a>. The idea is to consider a base object (which will be the
+source of several patches) as the <strong>last</strong> version of your file (in the case of
+<code>public-inbox</code>, the last email received) and build patches of previous versions
+with this base object. The heuristic comes from the spontaneous idea that, when
+it comes to software files, these grow entropically. The latest version is
+therefore most likely to contain all the similarities with previous versions.</p>
+<p>Once again, when it comes to code, we tend to add code. So we should be able to
+use all the occurrences available in the latest version of a file to produce
+patches for earlier versions.</p>
+<h3 id="comparison"><a class="anchor" aria-hidden="true" href="#comparison"></a>Comparison</h3>
+<p>Let's have some fun comparing <code>public-inbox</code> and the <code>blaze</code> tool:</p>
+<pre><code class="language-markdown">            +-------+--------------+------+
+            | blaze | public-inbox |  raw |
++-----------+-------+--------------+------+
+| caml-list |  160M |         154M | 425M |
++-----------+-------+--------------+------+
+| lkml.15   |  1.1G |         981M | 2.7G |
++-----------+-------+--------------+------+
+| kvm.0     |  1.2G |         1.1G | 3.1G |
++-----------+-------+--------------+------+
+</code></pre>
+<p>The first thing you'll notice is that <code>blaze</code> produces PACK files that are a
+little larger than those produced by Git. The problem is that <code>blaze</code> doesn't
+store exactly the same thing! The emails it stores are emails with lines ending
+in <code>\r\n</code>, whereas <code>public-inbox</code> stores emails with <code>\n</code>. It may just be a
+small character, but multiplied by the number of lines in an email and the
+number of emails in the archive, it's got its weight.</p>
+<p>It's also true that <a href="https://github.com/mirage/decompress">decompress</a>, the OCaml implementation of zlib,
+is not as efficient as its C competitor in terms of ratio. So this is
+disadvantage we have, which is not linked to the way we generate the PACK file
+(we could replace <code>decompress</code> with zlib!).</p>
+<p>However, there's another interesting metric between what we produce and what
+<code>public-inbox</code> does. It's important to understand that we maintain &quot;some
+compatibility&quot; with the Git PACK file. The objects aren't the same and don't
+have the same meaning, but it's still a PACK file. As such, we can use <code>git verify-pack</code> on our archive as on the <code>public-inbox</code> archive:</p>
+<pre><code class="language-markdown">            +-----------------+------------------------+
+            | PACK from blaze | PACK from public-inbox |
++-----------+-----------------+------------------------+
+| caml-list |           ~2.5s |                  ~4.1s |
++-----------+-----------------+------------------------+
+| lkml.15   |          ~14.7s |                 ~16.3s | 
++-----------+-----------------+------------------------+
+| kvm.0     |            ~18s |                   ~21s |
++-----------+-----------------+------------------------+
+</code></pre>
+<p>The analysis of our PACK file is faster than the one of <code>public-inbox</code>. This is
+where we need to understand what we're trying to store and how we're doing it.
+When it comes to finding a candidate for a patch, <code>blaze</code> relies solely on the
+similarities between the two objects/emails they have, whereas <code>public-inbox</code>,
+via Git heuristics, will still prioritize a patch between emails that follow
+each other in temporality via &quot;versioning&quot;.</p>
+<p>The implication is that the last 2 emails may have no similarity at all, but
+Git/<code>public-inbox</code> will still try to patch them together, as one is the
+<em>previous version</em> (in terms of time) of the other.</p>
+<p>Another aspect is that Git sometimes breaks <em>the patch chain</em> so that, when it
+comes to extracting an object, if it's a patch, its source shouldn't be very far
+away in the PACK file. Git prefers to patch an object with a source that may be
+less good but close to it, rather than keeping the best candidate as the source
+for all patches. Here too, <code>blaze</code> reacts differently: we try to keep and reuse
+the best candidate as much as possible.</p>
+<p>A final difference, which may also be important, is the way in which emails are
+stored. We often refer to e-mails as &quot;split&quot;, whereas <code>public-inbox</code> only stores
+them as they are. One implication of this can be the extraction of an
+attachment. As far as <code>blaze</code> is concerned, we would just have to extract the
+<em>skeleton</em> of the email, search in the various headers for the desired
+attachment and extract the attachment as is, which is a full-fledged object in
+our PACK file.</p>
+<p>As for <code>public-inbox</code>, we'd have to extract the email, <strong>parse</strong> the email, then
+search for the part containing the attachment according to the header and
+finally extract the attachment.</p>
+<h3 id="conclusion-1"><a class="anchor" aria-hidden="true" href="#conclusion-1"></a>Conclusion</h3>
+<p>If we had to draw a &quot;meta&quot; conclusion from the differences between <code>blaze</code> and
+<code>public-inbox</code>, it's that our tool focuses on the content of your emails,
+whereas <code>public-inbox</code> focuses on the historicity of your emails. As such, and
+in the hope of making an OCaml-based email client, we believe our approach
+remains interesting.</p>
+<p>But these experiments have shown us 2 important things:</p>
+<ul>
+<li>we're capable of handling millions of emails, parsing and storing them. It's
+pretty impressive to see our tool handle almost a million emails (<code>kvm.0</code>)
+without any bugs!</li>
+<li>the second thing is that our initial intuition remains valid. Even if the path
+seems subtly different from what <code>public-inbox</code> can do, our approach is
+clearly the right one and keeps us going.</li>
+</ul>
 
     </article>