From 8d9c397d1a06427ceb833a8fd0ae87675ce1bc52 Mon Sep 17 00:00:00 2001 From: The Robur team Date: Fri, 17 Jan 2025 16:38:10 +0000 Subject: [PATCH] Pushed by YOCaml 2 from d74e429ba7767645a0da0af3cd39e28920eb6d82 --- articles/2025-01-07-carton-and-cachet.html | 114 +++++++++++++++++++++ 1 file changed, 114 insertions(+) diff --git a/articles/2025-01-07-carton-and-cachet.html b/articles/2025-01-07-carton-and-cachet.html index 9174e2b..75eae89 100644 --- a/articles/2025-01-07-carton-and-cachet.html +++ b/articles/2025-01-07-carton-and-cachet.html @@ -378,6 +378,120 @@ keyword-based e-mail indexing system, but it provides a good basis for the rest of our project.

So, if you like what we're doing and want to help, you can make a donation via GitHub or using our IBAN.

+
+

Post

+

This little note is an extension of the feedback we got on the Carton release. +nojb, in this case, pointed to the public-inbox +software as the archiver of the various Linux kernel mailing lists. The latter +is based on the same intuition we had, namely to use the PACK format to archive +emails.

+

The question then arises: are we starting to remake the wheel?

+

In truth, the devil is in the detail. As it happens, you can download LKML +mailing list archives with Git in this way:

+
$ git clone --mirror http://lore.kernel.org/lkml/15 lkml/git/15.git
+$ cd lkml/git/15.git
+$ du -sh objects/pack/pack-*.pack
+981M	objects/pack/pack-*.pack
+$ cd objects/pack/
+$ mkdir loose
+$ carton explode 'loose/%s/%s' pack-*.pack
+$ du -sh loose/c/
+2.7G	loose/c
+
+

public-inbox is based not only on the PACK format for email archiving, but +also on Git concepts. In this case, such a Git repository actually only contains +an m file corresponding to the last email received on the mailing list. The +other e-mails are "old versions of this e-mail". In this case, public-inbox +considers a certain form of versioning between emails. Each commit is a new +email and will "replace" the previous one.

+

Heuristics to patch

+

public-inbox then relies on the heuristics implemented by Git to find the best +candidate for patching emails. These heuristics are explained +here. The idea is to consider a base object (which will be the +source of several patches) as the last version of your file (in the case of +public-inbox, the last email received) and build patches of previous versions +with this base object. The heuristic comes from the spontaneous idea that, when +it comes to software files, these grow entropically. The latest version is +therefore most likely to contain all the similarities with previous versions.

+

Once again, when it comes to code, we tend to add code. So we should be able to +use all the occurrences available in the latest version of a file to produce +patches for earlier versions.

+

Comparison

+

Let's have some fun comparing public-inbox and the blaze tool:

+
            +-------+--------------+------+
+            | blaze | public-inbox |  raw |
++-----------+-------+--------------+------+
+| caml-list |  160M |         154M | 425M |
++-----------+-------+--------------+------+
+| lkml.15   |  1.1G |         981M | 2.7G |
++-----------+-------+--------------+------+
+| kvm.0     |  1.2G |         1.1G | 3.1G |
++-----------+-------+--------------+------+
+
+

The first thing you'll notice is that blaze produces PACK files that are a +little larger than those produced by Git. The problem is that blaze doesn't +store exactly the same thing! The emails it stores are emails with lines ending +in \r\n, whereas public-inbox stores emails with \n. It may just be a +small character, but multiplied by the number of lines in an email and the +number of emails in the archive, it's got its weight.

+

It's also true that decompress, the OCaml implementation of zlib, +is not as efficient as its C competitor in terms of ratio. So this is +disadvantage we have, which is not linked to the way we generate the PACK file +(we could replace decompress with zlib!).

+

However, there's another interesting metric between what we produce and what +public-inbox does. It's important to understand that we maintain "some +compatibility" with the Git PACK file. The objects aren't the same and don't +have the same meaning, but it's still a PACK file. As such, we can use git verify-pack on our archive as on the public-inbox archive:

+
            +-----------------+------------------------+
+            | PACK from blaze | PACK from public-inbox |
++-----------+-----------------+------------------------+
+| caml-list |           ~2.5s |                  ~4.1s |
++-----------+-----------------+------------------------+
+| lkml.15   |          ~14.7s |                 ~16.3s | 
++-----------+-----------------+------------------------+
+| kvm.0     |            ~18s |                   ~21s |
++-----------+-----------------+------------------------+
+
+

The analysis of our PACK file is faster than the one of public-inbox. This is +where we need to understand what we're trying to store and how we're doing it. +When it comes to finding a candidate for a patch, blaze relies solely on the +similarities between the two objects/emails they have, whereas public-inbox, +via Git heuristics, will still prioritize a patch between emails that follow +each other in temporality via "versioning".

+

The implication is that the last 2 emails may have no similarity at all, but +Git/public-inbox will still try to patch them together, as one is the +previous version (in terms of time) of the other.

+

Another aspect is that Git sometimes breaks the patch chain so that, when it +comes to extracting an object, if it's a patch, its source shouldn't be very far +away in the PACK file. Git prefers to patch an object with a source that may be +less good but close to it, rather than keeping the best candidate as the source +for all patches. Here too, blaze reacts differently: we try to keep and reuse +the best candidate as much as possible.

+

A final difference, which may also be important, is the way in which emails are +stored. We often refer to e-mails as "split", whereas public-inbox only stores +them as they are. One implication of this can be the extraction of an +attachment. As far as blaze is concerned, we would just have to extract the +skeleton of the email, search in the various headers for the desired +attachment and extract the attachment as is, which is a full-fledged object in +our PACK file.

+

As for public-inbox, we'd have to extract the email, parse the email, then +search for the part containing the attachment according to the header and +finally extract the attachment.

+

Conclusion

+

If we had to draw a "meta" conclusion from the differences between blaze and +public-inbox, it's that our tool focuses on the content of your emails, +whereas public-inbox focuses on the historicity of your emails. As such, and +in the hope of making an OCaml-based email client, we believe our approach +remains interesting.

+

But these experiments have shown us 2 important things:

+