Merge pull request 'Extend the article about carton with some notes' (#27) from extend-carton into main

Reviewed-on: #27
2025-01-17 16:37:54 +00:00 · 2025-01-17 16:37:54 +00:00 · d74e429ba7
commit d74e429ba7
parent dd0fc6e95e 8ccb063981
1 changed files with 142 additions and 0 deletions
--- a/articles/2025-01-07-carton-and-cachet.md
+++ b/articles/2025-01-07-carton-and-cachet.md
@ -443,6 +443,144 @@ of our project.
 So, if you like what we're doing and want to help, you can make a donation via
 [GitHub][donate-github] or using our [IBAN][donate-iban].

+<hr />
+
+## Post
+
+This little note is an extension of the feedback we got on the Carton release.
+[nojb][nojb], in this case, pointed to the [public-inbox][public-inbox]
+software as the archiver of the various Linux kernel mailing lists. The latter
+is based on the same intuition we had, namely to use the PACK format to archive
+emails.
+
+The question then arises: are we starting to remake the wheel?
+
+In truth, the devil is in the detail. As it happens, you can download LKML
+mailing list archives with Git in this way:
+```shell
+$ git clone --mirror http://lore.kernel.org/lkml/15 lkml/git/15.git
+$ cd lkml/git/15.git
+$ du -sh objects/pack/pack-*.pack
+981M	objects/pack/pack-*.pack
+$ cd objects/pack/
+$ mkdir loose
+$ carton explode 'loose/%s/%s' pack-*.pack
+$ du -sh loose/c/
+2.7G	loose/c
+```
+`public-inbox` is based not only on the PACK format for email archiving, but
+also on Git concepts. In this case, such a Git repository actually only contains
+an `m` file corresponding to the last email received on the mailing list. The
+other e-mails are "old versions of this e-mail". In this case, `public-inbox`
+considers a certain form of _versioning_ between emails. Each commit is a new
+email and will "replace" the previous one.
+
+### Heuristics to patch
+
+`public-inbox` then relies on the heuristics implemented by Git to find the best
+candidate for patching emails. These heuristics are explained
+[here][git-heuristics]. The idea is to consider a base object (which will be the
+source of several patches) as the **last** version of your file (in the case of
+`public-inbox`, the last email received) and build patches of previous versions
+with this base object. The heuristic comes from the spontaneous idea that, when
+it comes to software files, these grow entropically. The latest version is
+therefore most likely to contain all the similarities with previous versions.
+
+Once again, when it comes to code, we tend to add code. So we should be able to
+use all the occurrences available in the latest version of a file to produce
+patches for earlier versions.
+
+### Comparison
+
+Let's have some fun comparing `public-inbox` and the `blaze` tool:
+```markdown
+            +-------+--------------+------+
+            | blaze | public-inbox |  raw |
+-----------+-------+--------------+------+
+| caml-list |  160M |         154M | 425M |
+-----------+-------+--------------+------+
+| lkml.15   |  1.1G |         981M | 2.7G |
+-----------+-------+--------------+------+
+| kvm.0     |  1.2G |         1.1G | 3.1G |
+-----------+-------+--------------+------+
+```
+
+The first thing you'll notice is that `blaze` produces PACK files that are a
+little larger than those produced by Git. The problem is that `blaze` doesn't
+store exactly the same thing! The emails it stores are emails with lines ending
+in `\r\n`, whereas `public-inbox` stores emails with `\n`. It may just be a
+small character, but multiplied by the number of lines in an email and the
+number of emails in the archive, it's got its weight.
+
+It's also true that [decompress][decompress], the OCaml implementation of zlib,
+is not as efficient as its C competitor in terms of ratio. So this is
+disadvantage we have, which is not linked to the way we generate the PACK file
+(we could replace `decompress` with zlib!).
+
+However, there's another interesting metric between what we produce and what
+`public-inbox` does. It's important to understand that we maintain "some
+compatibility" with the Git PACK file. The objects aren't the same and don't
+have the same meaning, but it's still a PACK file. As such, we can use `git
+verify-pack` on our archive as on the `public-inbox` archive:
+
+```markdown
+            +-----------------+------------------------+
+            | PACK from blaze | PACK from public-inbox |
+-----------+-----------------+------------------------+
+| caml-list |           ~2.5s |                  ~4.1s |
+-----------+-----------------+------------------------+
+| lkml.15   |          ~14.7s |                 ~16.3s | 
+-----------+-----------------+------------------------+
+| kvm.0     |            ~18s |                   ~21s |
+-----------+-----------------+------------------------+
+```
+
+The analysis of our PACK file is faster than the one of `public-inbox`. This is
+where we need to understand what we're trying to store and how we're doing it.
+When it comes to finding a candidate for a patch, `blaze` relies solely on the
+similarities between the two objects/emails they have, whereas `public-inbox`,
+via Git heuristics, will still prioritize a patch between emails that follow
+each other in temporality via "versioning".
+
+The implication is that the last 2 emails may have no similarity at all, but
+Git/`public-inbox` will still try to patch them together, as one is the
+_previous version_ (in terms of time) of the other.
+
+Another aspect is that Git sometimes breaks _the patch chain_ so that, when it
+comes to extracting an object, if it's a patch, its source shouldn't be very far
+away in the PACK file. Git prefers to patch an object with a source that may be
+less good but close to it, rather than keeping the best candidate as the source
+for all patches. Here too, `blaze` reacts differently: we try to keep and reuse
+the best candidate as much as possible.
+
+A final difference, which may also be important, is the way in which emails are
+stored. We often refer to e-mails as "split", whereas `public-inbox` only stores
+them as they are. One implication of this can be the extraction of an
+attachment. As far as `blaze` is concerned, we would just have to extract the
+_skeleton_ of the email, search in the various headers for the desired
+attachment and extract the attachment as is, which is a full-fledged object in
+our PACK file.
+
+As for `public-inbox`, we'd have to extract the email, **parse** the email, then
+search for the part containing the attachment according to the header and
+finally extract the attachment.
+
+### Conclusion
+
+If we had to draw a "meta" conclusion from the differences between `blaze` and
+`public-inbox`, it's that our tool focuses on the content of your emails,
+whereas `public-inbox` focuses on the historicity of your emails. As such, and
+in the hope of making an OCaml-based email client, we believe our approach
+remains interesting.
+
+But these experiments have shown us 2 important things:
+- we're capable of handling millions of emails, parsing and storing them. It's
+  pretty impressive to see our tool handle almost a million emails (`kvm.0`)
+  without any bugs!
+- the second thing is that our initial intuition remains valid. Even if the path
+  seems subtly different from what `public-inbox` can do, our approach is
+  clearly the right one and keeps us going.
+
 [patience-diff]: https://opensource.janestreet.com/patdiff/
 [eugene-myers-diff]: https://www.nathaniel.ai/myers-diff/
 [duff]: https://github.com/mirage/duff
@ -460,3 +598,7 @@ So, if you like what we're doing and want to help, you can make a donation via
 [donate-github]: https://github.com/sponsors/robur-coop
 [donate-iban]: https://robur.coop/Donate
 [miou]: https://github.com/robur-coop/miou
+[nojb]: https://discuss.ocaml.org/t/ann-release-of-carton-1-0-0-and-cachet/15953/2?u=dinosaure
+[public-inbox]: https://public-inbox.org/README.html
+[decompress]: https://github.com/mirage/decompress
+[git-heuristics]: https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt