Pushed by YOCaml 2 from d74e429ba7

2025-01-17 16:38:10 +00:00 · 2025-01-17 16:38:10 +00:00 · 8d9c397d1a
commit 8d9c397d1a
parent 3bd2699da3
1 changed files with 114 additions and 0 deletions
--- a/articles/2025-01-07-carton-and-cachet.html
+++ b/articles/2025-01-07-carton-and-cachet.html
@ -378,6 +378,120 @@ keyword-based e-mail indexing system, but it provides a good basis for the rest
 of our project.</p>
 <p>So, if you like what we're doing and want to help, you can make a donation via
 <a href="https://github.com/sponsors/robur-coop">GitHub</a> or using our <a href="https://robur.coop/Donate">IBAN</a>.</p>
 <hr />
 <h2 id="post"><a class="anchor" aria-hidden="true" href="#post"></a>Post</h2>
 <p>This little note is an extension of the feedback we got on the Carton release.
 <a href="https://discuss.ocaml.org/t/ann-release-of-carton-1-0-0-and-cachet/15953/2?u=dinosaure">nojb</a>, in this case, pointed to the <a href="https://public-inbox.org/README.html">public-inbox</a>
 software as the archiver of the various Linux kernel mailing lists. The latter
 is based on the same intuition we had, namely to use the PACK format to archive
 emails.</p>
 <p>The question then arises: are we starting to remake the wheel?</p>
 <p>In truth, the devil is in the detail. As it happens, you can download LKML
 mailing list archives with Git in this way:</p>
 <pre><code class="language-shell">$ git clone --mirror http://lore.kernel.org/lkml/15 lkml/git/15.git
 $ cd lkml/git/15.git
 $ du -sh objects/pack/pack-*.pack
 981M	objects/pack/pack-*.pack
 $ cd objects/pack/
 $ mkdir loose
 $ carton explode 'loose/%s/%s' pack-*.pack
 $ du -sh loose/c/
 2.7G	loose/c
 </code></pre>
 <p><code>public-inbox</code> is based not only on the PACK format for email archiving, but
 also on Git concepts. In this case, such a Git repository actually only contains
 an <code>m</code> file corresponding to the last email received on the mailing list. The
 other e-mails are &quot;old versions of this e-mail&quot;. In this case, <code>public-inbox</code>
 considers a certain form of <em>versioning</em> between emails. Each commit is a new
 email and will &quot;replace&quot; the previous one.</p>
 <h3 id="heuristics-to-patch"><a class="anchor" aria-hidden="true" href="#heuristics-to-patch"></a>Heuristics to patch</h3>
 <p><code>public-inbox</code> then relies on the heuristics implemented by Git to find the best
 candidate for patching emails. These heuristics are explained
 <a href="https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt">here</a>. The idea is to consider a base object (which will be the
 source of several patches) as the <strong>last</strong> version of your file (in the case of
 <code>public-inbox</code>, the last email received) and build patches of previous versions
 with this base object. The heuristic comes from the spontaneous idea that, when
 it comes to software files, these grow entropically. The latest version is
 therefore most likely to contain all the similarities with previous versions.</p>
 <p>Once again, when it comes to code, we tend to add code. So we should be able to
 use all the occurrences available in the latest version of a file to produce
 patches for earlier versions.</p>
 <h3 id="comparison"><a class="anchor" aria-hidden="true" href="#comparison"></a>Comparison</h3>
 <p>Let's have some fun comparing <code>public-inbox</code> and the <code>blaze</code> tool:</p>
 <pre><code class="language-markdown">            +-------+--------------+------+
            | blaze | public-inbox |  raw |
 +-----------+-------+--------------+------+
 | caml-list |  160M |         154M | 425M |
 +-----------+-------+--------------+------+
 | lkml.15   |  1.1G |         981M | 2.7G |
 +-----------+-------+--------------+------+
 | kvm.0     |  1.2G |         1.1G | 3.1G |
 +-----------+-------+--------------+------+
 </code></pre>
 <p>The first thing you'll notice is that <code>blaze</code> produces PACK files that are a
 little larger than those produced by Git. The problem is that <code>blaze</code> doesn't
 store exactly the same thing! The emails it stores are emails with lines ending
 in <code>\r\n</code>, whereas <code>public-inbox</code> stores emails with <code>\n</code>. It may just be a
 small character, but multiplied by the number of lines in an email and the
 number of emails in the archive, it's got its weight.</p>
 <p>It's also true that <a href="https://github.com/mirage/decompress">decompress</a>, the OCaml implementation of zlib,
 is not as efficient as its C competitor in terms of ratio. So this is
 disadvantage we have, which is not linked to the way we generate the PACK file
 (we could replace <code>decompress</code> with zlib!).</p>
 <p>However, there's another interesting metric between what we produce and what
 <code>public-inbox</code> does. It's important to understand that we maintain &quot;some
 compatibility&quot; with the Git PACK file. The objects aren't the same and don't
 have the same meaning, but it's still a PACK file. As such, we can use <code>git verify-pack</code> on our archive as on the <code>public-inbox</code> archive:</p>
 <pre><code class="language-markdown">            +-----------------+------------------------+
            | PACK from blaze | PACK from public-inbox |
 +-----------+-----------------+------------------------+
 | caml-list |           ~2.5s |                  ~4.1s |
 +-----------+-----------------+------------------------+
 | lkml.15   |          ~14.7s |                 ~16.3s | 
 +-----------+-----------------+------------------------+
 | kvm.0     |            ~18s |                   ~21s |
 +-----------+-----------------+------------------------+
 </code></pre>
 <p>The analysis of our PACK file is faster than the one of <code>public-inbox</code>. This is
 where we need to understand what we're trying to store and how we're doing it.
 When it comes to finding a candidate for a patch, <code>blaze</code> relies solely on the
 similarities between the two objects/emails they have, whereas <code>public-inbox</code>,
 via Git heuristics, will still prioritize a patch between emails that follow
 each other in temporality via &quot;versioning&quot;.</p>
 <p>The implication is that the last 2 emails may have no similarity at all, but
 Git/<code>public-inbox</code> will still try to patch them together, as one is the
 <em>previous version</em> (in terms of time) of the other.</p>
 <p>Another aspect is that Git sometimes breaks <em>the patch chain</em> so that, when it
 comes to extracting an object, if it's a patch, its source shouldn't be very far
 away in the PACK file. Git prefers to patch an object with a source that may be
 less good but close to it, rather than keeping the best candidate as the source
 for all patches. Here too, <code>blaze</code> reacts differently: we try to keep and reuse
 the best candidate as much as possible.</p>
 <p>A final difference, which may also be important, is the way in which emails are
 stored. We often refer to e-mails as &quot;split&quot;, whereas <code>public-inbox</code> only stores
 them as they are. One implication of this can be the extraction of an
 attachment. As far as <code>blaze</code> is concerned, we would just have to extract the
 <em>skeleton</em> of the email, search in the various headers for the desired
 attachment and extract the attachment as is, which is a full-fledged object in
 our PACK file.</p>
 <p>As for <code>public-inbox</code>, we'd have to extract the email, <strong>parse</strong> the email, then
 search for the part containing the attachment according to the header and
 finally extract the attachment.</p>
 <h3 id="conclusion-1"><a class="anchor" aria-hidden="true" href="#conclusion-1"></a>Conclusion</h3>
 <p>If we had to draw a &quot;meta&quot; conclusion from the differences between <code>blaze</code> and
 <code>public-inbox</code>, it's that our tool focuses on the content of your emails,
 whereas <code>public-inbox</code> focuses on the historicity of your emails. As such, and
 in the hope of making an OCaml-based email client, we believe our approach
 remains interesting.</p>
 <p>But these experiments have shown us 2 important things:</p>
 <ul>
 <li>we're capable of handling millions of emails, parsing and storing them. It's
 pretty impressive to see our tool handle almost a million emails (<code>kvm.0</code>)
 without any bugs!</li>
 <li>the second thing is that our initial intuition remains valid. Even if the path
 seems subtly different from what <code>public-inbox</code> can do, our approach is
 clearly the right one and keeps us going.</li>
 </ul>
    </article>