diff --git a/articles/2025-01-07-carton-and-cachet.md b/articles/2025-01-07-carton-and-cachet.md index acc0fad..a28d61c 100644 --- a/articles/2025-01-07-carton-and-cachet.md +++ b/articles/2025-01-07-carton-and-cachet.md @@ -443,6 +443,144 @@ of our project. So, if you like what we're doing and want to help, you can make a donation via [GitHub][donate-github] or using our [IBAN][donate-iban]. +
+ +## Post + +This little note is an extension of the feedback we got on the Carton release. +[nojb][nojb], in this case, pointed to the [public-inbox][public-inbox] +software as the archiver of the various Linux kernel mailing lists. The latter +is based on the same intuition we had, namely to use the PACK format to archive +emails. + +The question then arises: are we starting to remake the wheel? + +In truth, the devil is in the detail. As it happens, you can download LKML +mailing list archives with Git in this way: +```shell +$ git clone --mirror http://lore.kernel.org/lkml/15 lkml/git/15.git +$ cd lkml/git/15.git +$ du -sh objects/pack/pack-*.pack +981M objects/pack/pack-*.pack +$ cd objects/pack/ +$ mkdir loose +$ carton explode 'loose/%s/%s' pack-*.pack +$ du -sh loose/c/ +2.7G loose/c +``` +`public-inbox` is based not only on the PACK format for email archiving, but +also on Git concepts. In this case, such a Git repository actually only contains +an `m` file corresponding to the last email received on the mailing list. The +other e-mails are "old versions of this e-mail". In this case, `public-inbox` +considers a certain form of _versioning_ between emails. Each commit is a new +email and will "replace" the previous one. + +### Heuristics to patch + +`public-inbox` then relies on the heuristics implemented by Git to find the best +candidate for patching emails. These heuristics are explained +[here][git-heuristics]. The idea is to consider a base object (which will be the +source of several patches) as the **last** version of your file (in the case of +`public-inbox`, the last email received) and build patches of previous versions +with this base object. The heuristic comes from the spontaneous idea that, when +it comes to software files, these grow entropically. The latest version is +therefore most likely to contain all the similarities with previous versions. + +Once again, when it comes to code, we tend to add code. So we should be able to +use all the occurrences available in the latest version of a file to produce +patches for earlier versions. + +### Comparison + +Let's have some fun comparing `public-inbox` and the `blaze` tool: +```markdown + +-------+--------------+------+ + | blaze | public-inbox | raw | ++-----------+-------+--------------+------+ +| caml-list | 160M | 154M | 425M | ++-----------+-------+--------------+------+ +| lkml.15 | 1.1G | 981M | 2.7G | ++-----------+-------+--------------+------+ +| kvm.0 | 1.2G | 1.1G | 3.1G | ++-----------+-------+--------------+------+ +``` + +The first thing you'll notice is that `blaze` produces PACK files that are a +little larger than those produced by Git. The problem is that `blaze` doesn't +store exactly the same thing! The emails it stores are emails with lines ending +in `\r\n`, whereas `public-inbox` stores emails with `\n`. It may just be a +small character, but multiplied by the number of lines in an email and the +number of emails in the archive, it's got its weight. + +It's also true that [decompress][decompress], the OCaml implementation of zlib, +is not as efficient as its C competitor in terms of ratio. So this is +disadvantage we have, which is not linked to the way we generate the PACK file +(we could replace `decompress` with zlib!). + +However, there's another interesting metric between what we produce and what +`public-inbox` does. It's important to understand that we maintain "some +compatibility" with the Git PACK file. The objects aren't the same and don't +have the same meaning, but it's still a PACK file. As such, we can use `git +verify-pack` on our archive as on the `public-inbox` archive: + +```markdown + +-----------------+------------------------+ + | PACK from blaze | PACK from public-inbox | ++-----------+-----------------+------------------------+ +| caml-list | ~2.5s | ~4.1s | ++-----------+-----------------+------------------------+ +| lkml.15 | ~14.7s | ~16.3s | ++-----------+-----------------+------------------------+ +| kvm.0 | ~18s | ~21s | ++-----------+-----------------+------------------------+ +``` + +The analysis of our PACK file is faster than the one of `public-inbox`. This is +where we need to understand what we're trying to store and how we're doing it. +When it comes to finding a candidate for a patch, `blaze` relies solely on the +similarities between the two objects/emails they have, whereas `public-inbox`, +via Git heuristics, will still prioritize a patch between emails that follow +each other in temporality via "versioning". + +The implication is that the last 2 emails may have no similarity at all, but +Git/`public-inbox` will still try to patch them together, as one is the +_previous version_ (in terms of time) of the other. + +Another aspect is that Git sometimes breaks _the patch chain_ so that, when it +comes to extracting an object, if it's a patch, its source shouldn't be very far +away in the PACK file. Git prefers to patch an object with a source that may be +less good but close to it, rather than keeping the best candidate as the source +for all patches. Here too, `blaze` reacts differently: we try to keep and reuse +the best candidate as much as possible. + +A final difference, which may also be important, is the way in which emails are +stored. We often refer to e-mails as "split", whereas `public-inbox` only stores +them as they are. One implication of this can be the extraction of an +attachment. As far as `blaze` is concerned, we would just have to extract the +_skeleton_ of the email, search in the various headers for the desired +attachment and extract the attachment as is, which is a full-fledged object in +our PACK file. + +As for `public-inbox`, we'd have to extract the email, **parse** the email, then +search for the part containing the attachment according to the header and +finally extract the attachment. + +### Conclusion + +If we had to draw a "meta" conclusion from the differences between `blaze` and +`public-inbox`, it's that our tool focuses on the content of your emails, +whereas `public-inbox` focuses on the historicity of your emails. As such, and +in the hope of making an OCaml-based email client, we believe our approach +remains interesting. + +But these experiments have shown us 2 important things: +- we're capable of handling millions of emails, parsing and storing them. It's + pretty impressive to see our tool handle almost a million emails (`kvm.0`) + without any bugs! +- the second thing is that our initial intuition remains valid. Even if the path + seems subtly different from what `public-inbox` can do, our approach is + clearly the right one and keeps us going. + [patience-diff]: https://opensource.janestreet.com/patdiff/ [eugene-myers-diff]: https://www.nathaniel.ai/myers-diff/ [duff]: https://github.com/mirage/duff @@ -460,3 +598,7 @@ So, if you like what we're doing and want to help, you can make a donation via [donate-github]: https://github.com/sponsors/robur-coop [donate-iban]: https://robur.coop/Donate [miou]: https://github.com/robur-coop/miou +[nojb]: https://discuss.ocaml.org/t/ann-release-of-carton-1-0-0-and-cachet/15953/2?u=dinosaure +[public-inbox]: https://public-inbox.org/README.html +[decompress]: https://github.com/mirage/decompress +[git-heuristics]: https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt