Merge pull request 'Extend the article about carton with some notes' (#27) from extend-carton into main

Reviewed-on: #27
This commit is contained in:
dinosaure 2025-01-17 16:37:54 +00:00
commit d74e429ba7

View file

@ -443,6 +443,144 @@ of our project.
So, if you like what we're doing and want to help, you can make a donation via
[GitHub][donate-github] or using our [IBAN][donate-iban].
<hr />
## Post
This little note is an extension of the feedback we got on the Carton release.
[nojb][nojb], in this case, pointed to the [public-inbox][public-inbox]
software as the archiver of the various Linux kernel mailing lists. The latter
is based on the same intuition we had, namely to use the PACK format to archive
emails.
The question then arises: are we starting to remake the wheel?
In truth, the devil is in the detail. As it happens, you can download LKML
mailing list archives with Git in this way:
```shell
$ git clone --mirror http://lore.kernel.org/lkml/15 lkml/git/15.git
$ cd lkml/git/15.git
$ du -sh objects/pack/pack-*.pack
981M objects/pack/pack-*.pack
$ cd objects/pack/
$ mkdir loose
$ carton explode 'loose/%s/%s' pack-*.pack
$ du -sh loose/c/
2.7G loose/c
```
`public-inbox` is based not only on the PACK format for email archiving, but
also on Git concepts. In this case, such a Git repository actually only contains
an `m` file corresponding to the last email received on the mailing list. The
other e-mails are "old versions of this e-mail". In this case, `public-inbox`
considers a certain form of _versioning_ between emails. Each commit is a new
email and will "replace" the previous one.
### Heuristics to patch
`public-inbox` then relies on the heuristics implemented by Git to find the best
candidate for patching emails. These heuristics are explained
[here][git-heuristics]. The idea is to consider a base object (which will be the
source of several patches) as the **last** version of your file (in the case of
`public-inbox`, the last email received) and build patches of previous versions
with this base object. The heuristic comes from the spontaneous idea that, when
it comes to software files, these grow entropically. The latest version is
therefore most likely to contain all the similarities with previous versions.
Once again, when it comes to code, we tend to add code. So we should be able to
use all the occurrences available in the latest version of a file to produce
patches for earlier versions.
### Comparison
Let's have some fun comparing `public-inbox` and the `blaze` tool:
```markdown
+-------+--------------+------+
| blaze | public-inbox | raw |
+-----------+-------+--------------+------+
| caml-list | 160M | 154M | 425M |
+-----------+-------+--------------+------+
| lkml.15 | 1.1G | 981M | 2.7G |
+-----------+-------+--------------+------+
| kvm.0 | 1.2G | 1.1G | 3.1G |
+-----------+-------+--------------+------+
```
The first thing you'll notice is that `blaze` produces PACK files that are a
little larger than those produced by Git. The problem is that `blaze` doesn't
store exactly the same thing! The emails it stores are emails with lines ending
in `\r\n`, whereas `public-inbox` stores emails with `\n`. It may just be a
small character, but multiplied by the number of lines in an email and the
number of emails in the archive, it's got its weight.
It's also true that [decompress][decompress], the OCaml implementation of zlib,
is not as efficient as its C competitor in terms of ratio. So this is
disadvantage we have, which is not linked to the way we generate the PACK file
(we could replace `decompress` with zlib!).
However, there's another interesting metric between what we produce and what
`public-inbox` does. It's important to understand that we maintain "some
compatibility" with the Git PACK file. The objects aren't the same and don't
have the same meaning, but it's still a PACK file. As such, we can use `git
verify-pack` on our archive as on the `public-inbox` archive:
```markdown
+-----------------+------------------------+
| PACK from blaze | PACK from public-inbox |
+-----------+-----------------+------------------------+
| caml-list | ~2.5s | ~4.1s |
+-----------+-----------------+------------------------+
| lkml.15 | ~14.7s | ~16.3s |
+-----------+-----------------+------------------------+
| kvm.0 | ~18s | ~21s |
+-----------+-----------------+------------------------+
```
The analysis of our PACK file is faster than the one of `public-inbox`. This is
where we need to understand what we're trying to store and how we're doing it.
When it comes to finding a candidate for a patch, `blaze` relies solely on the
similarities between the two objects/emails they have, whereas `public-inbox`,
via Git heuristics, will still prioritize a patch between emails that follow
each other in temporality via "versioning".
The implication is that the last 2 emails may have no similarity at all, but
Git/`public-inbox` will still try to patch them together, as one is the
_previous version_ (in terms of time) of the other.
Another aspect is that Git sometimes breaks _the patch chain_ so that, when it
comes to extracting an object, if it's a patch, its source shouldn't be very far
away in the PACK file. Git prefers to patch an object with a source that may be
less good but close to it, rather than keeping the best candidate as the source
for all patches. Here too, `blaze` reacts differently: we try to keep and reuse
the best candidate as much as possible.
A final difference, which may also be important, is the way in which emails are
stored. We often refer to e-mails as "split", whereas `public-inbox` only stores
them as they are. One implication of this can be the extraction of an
attachment. As far as `blaze` is concerned, we would just have to extract the
_skeleton_ of the email, search in the various headers for the desired
attachment and extract the attachment as is, which is a full-fledged object in
our PACK file.
As for `public-inbox`, we'd have to extract the email, **parse** the email, then
search for the part containing the attachment according to the header and
finally extract the attachment.
### Conclusion
If we had to draw a "meta" conclusion from the differences between `blaze` and
`public-inbox`, it's that our tool focuses on the content of your emails,
whereas `public-inbox` focuses on the historicity of your emails. As such, and
in the hope of making an OCaml-based email client, we believe our approach
remains interesting.
But these experiments have shown us 2 important things:
- we're capable of handling millions of emails, parsing and storing them. It's
pretty impressive to see our tool handle almost a million emails (`kvm.0`)
without any bugs!
- the second thing is that our initial intuition remains valid. Even if the path
seems subtly different from what `public-inbox` can do, our approach is
clearly the right one and keeps us going.
[patience-diff]: https://opensource.janestreet.com/patdiff/
[eugene-myers-diff]: https://www.nathaniel.ai/myers-diff/
[duff]: https://github.com/mirage/duff
@ -460,3 +598,7 @@ So, if you like what we're doing and want to help, you can make a donation via
[donate-github]: https://github.com/sponsors/robur-coop
[donate-iban]: https://robur.coop/Donate
[miou]: https://github.com/robur-coop/miou
[nojb]: https://discuss.ocaml.org/t/ann-release-of-carton-1-0-0-and-cachet/15953/2?u=dinosaure
[public-inbox]: https://public-inbox.org/README.html
[decompress]: https://github.com/mirage/decompress
[git-heuristics]: https://github.com/git/git/blob/master/Documentation/technical/pack-heuristics.txt