Add draft about carton #26

Merged
dinosaure merged 6 commits from carton into main 2025-01-14 10:18:00 +00:00

View file

@ -0,0 +1,462 @@
---
date: 2025-01-07
title: Git, Carton and emails
description: A way to store and archive your emails
tags:
- emails
- storage
- Git
author:
name: Romain Calascibetta
email: romain.calascibetta@gmail.com
link: https://blog.osau.re/
breaks: false
---
We are pleased to announce the release of Carton 1.0.0 and Cachet. You can have
an overview of these libraries in our announcement on the OCaml forum. This
article goes into more detail about the PACK format and its use for archiving
your emails.
## Back to Git and patches
In our Carton annoucement, we talk about 2 levels of compression for Git
objects, which are zlib compression and compression between objects using a
patch.
Furthermore, if we have 2 blobs (2 versions of a file), one of which contains
'A' and the other contains 'A+B', the second blob will probably be saved in the
form of a patch requiring the contents of the first blob and adding '+B'. At a
higher level and according to our use of Git, we understand that this second
level of compression is very interesting: we generally just add/remove few lines
(like introduce a new function) or delete some (removing code) in our files of
our project.
However, there is a bias in what Git does and what we perceive. We often think
that when it comes to patching in the case of Git, we think of the
[patience diff][patience-diff] or the [Eugene Myers diff][eugene-myers-diff].
While the latter offer the advantage of readability in terms of knowing what has
been added or deleted between two files, they are not necessarily optimal for
producing a _small_ patch.
In reality, what interests us in the case of the storage and transmission of
these patches over the network is not a form of readability in these patches but
an optimality in what can be considered as common between two files and what is
not. It is at this stage that the use of [duff][duff] is introduced.
This is a small library which can generate a patch between two files according
to the series of bytes common to both files. We're talking about 'series of
bytes' here because these elements common to our two files are not necessary
human readable. To find these series of common bytes, we use [Rabin's
fingerprint][rabin] algorithm: [a rolling hash][rolling-hash] used since time
immemorial.
### Patches and emails
So, as far as emails are concerned, it's fairly obvious that there are many
common "words" to all your emails. The simple word `From:` should exist in all
your emails.
From this simple idea, we can understand the impact, the headers of your emails
are more or less similar and have more or less the same content. The idea of
`duff`, applied to your emails, is to consider these other emails as a
"slightly" different version of your first email.
1) we can store a single raw email
2) and we build patches of all your other emails from this first one
A fairly concrete example of this compression through patches and emails is 2
notification emails from GitHub: these are quite similar, particularly in the
header. Even the content is just as similar: the HTML remains the same, only
the commentary differs.
```shell
$ carton diff github.01.eml github.02.eml -o patch.diff
$ du -sb github.01.eml github.02.eml patch.diff
9239 github.01.eml
9288 github.02.eml
5136 patch.diff
```
This example shows that our patch for rebuilding `github.02.eml` from
`github.01.eml` is almost 2 times smaller in size. In this case, with the PACK
format, this patch will also be compressed with zlib (and we can reach ~2900
bytes, so 3 times smaller).
#### Compress and compress!
To put this into perspective, a compression algorithm like zlib can also have
such a ratio (3 times smaller). But the latter also needs to serialise the
Huffman tree required for compression (in the general case). What can be
observed is that concatenating separately compressed emails makes it difficult
to maintain such a ratio. Worse, concatenating all the emails before compression
and compressing them all gives us a better ratio!
That's what the PACK file is all about: the aim is to be able to concatenate
these compressed emails and keep an interesting overall compression ratio. This
is the reason for the patch, to reduce the objects even further so that the
impact of the zlib _header_ on all our objects is minimal and, above
all, so that we can access to objects **without** having to decompress the
previous ones (as we would have to do for a `*.tar.gz` archive, for example).
The initial intuition about the emails was right, they do indeed share quite a
few elements together and in the end we were able to save ~4000 bytes in our
GitHub notification example.
## Isomorphism, DKIM and ARC
One attribute that we wanted to pay close attention to throughout our
experimentation was "isomorphism". This property is very simple: imagine a
function that takes an email as input and transforms it into another value
using a method (such as compression). Isomorphism ensures that we can 'undo'
this method and obtain exactly the same result again:
```
decode(encode(x)) == x
```
This property is very important for emails because signatures exist in your
email and these signatures result from the content of your email. If the email
changes, these signatures change too.
For instance, the DKIM signature allows you to sign an email and check its
integrity on receipt. ARC (which will be our next objective) also signs your
emails, but goes one step further: all the relays that receive your email and
send it back to the real destination must add a new ARC signature, just like
adding a new block to the Bitcoin blockchain.
So you need to make sure that the way you serialise your email (in a PACK file)
doesn't alter the content in order to keep these signatures valid! It just so
happens that here too we have a lot of experience with Git. Git has the same
constraint with [Merkle-Trees][merkle-tree] and as far as we're concerned, we've
developed a library that allows you to generate an encoder and a decoder from a
description and that respects the isomorphism property _by construction_: the
[encore][encore] library.
Then we could store our emails as they are in the PACK file. However, the
advantage of `duff` really comes into play when several objects are similar. In
the case of Git, tree objects are similar but they are not similar with
commits, for example. For emails, there is also such a distinction: the email
headers are similar but they are not similar to the email content.
You can therefore try to "split" emails into 2 parts, the header on one side and
the content on the other. We would then have a third value which would tell us
how to reconstruct our complete email (i.e. identify where the header is and
identify where the content is).
However, after years of reading email RFCs, things are much more complex. Above
all, this experience has enabled me to synthesise a skeleton that all emails
have:
```ocaml
(* multipart-body :=
[preamble CRLF]
--boundary transport-padding CRLF
part
( CRLF --boundary transport-padding CRLF part )*
CRLF
--boundary-- transport-padding
[CRLF epilogue]
part := headers ( CRLF body )?
*)
type 'octet body =
| Multipart of 'octet multipart
| Single of 'octet option
| Message of 'octet t
and 'octet part = { headers : 'octet; body : 'octet body }
and 'octet multipart =
{ preamble : string
; epilogue : string * transport_padding;
; boundary : string
; parts : (transport_padding * 'octet part) list }
and 'octet t = 'octet part
```
As you can see, the distinction is not only between the header and the content
but also between the parts of an email as soon as it has an attachment. You can
also have an email inside an email (and I'm always surprised to see that this
particular case is _frequent_). Finally, there's the annoying _preamble_ and
_epilogue_ of an email with several parts, which is often empty but necessary:
you always have to ensure isomorphism — even for "useless" bytes, they count for
signatures.
We'll therefore need to serialise this structure and all we have to do is
transform a `string t` and `SHA1.t t` so that our structure no longer contains
the actual content of our emails but a unique identifier referring to this
content and which will be available in our PACK file.
```ocaml
module Format : sig
val t : SHA1.t Encore.t
end
let decode =
let parser = Encore.to_angstrom Format.t in
Angstrom.parse_string ~consume:All parser str
let encode =
let emitter = Encore.to_lavoisier Format.t in
Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter
```
However, we need to check that the isomorphism is respected. You should be
aware that work on [Mr. MIME][mrmime] has already been done on this subject with
the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. This
ability to check isomorphism using afl has enabled us to use the latter to
generate valid random emails. This allows me to reintroduce you to the
[hamlet][hamlet] project, perhaps the biggest database of valid — but
incomprehensible — emails. So we've checked that our encoder/decoder for
“splitting” our emails respects isomophism on this million emails.
## Carton, POP3 & mbox, some metrics
We can therefore split an email into several parts and calculate an optimal
patch between two similar pieces of content. So now you can start packaging!
This is where I'm going to reintroduce a tool that hasn't been released yet,
but which allows me to go even further with emails: [blaze][blaze].
This little tool is my _Swiss army knife_ for emails! And it's in this tool
that we're going to have fun deriving Carton so that it can manipulate emails
rather than Git objects. So we've implemented the very basic [POP3][pop3]
protocol (and thanks to [ocaml-tls][tls] for offering a free encrypted
connection) as well as the [mbox][mbox] format.
Both are **not** recommended. The first is an old protocol and interacting with
Gmail, for example, is very slow. The second is an old, non-standardised format
for storing your emails — and unfortunately this may be the format used by your
email client. After resolving a few bugs such as the unspecified behaviour of
pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up
with lots of emails that you'll have fun packaging!
```shell
$ mkdir mailbox
$ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \
-u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' > mails.lst
$ blaze.pack make -o mailbox.pack mails.lst
$ tar czf mailbox.tar.gz mailbox
$ du -sh mailbox mailbox.pack mailbox.tar.gz
97M mailbox
28M mailbox.pack
23M mailbox.tar.gz
```
In this example, we download the latest emails from the last 30 days via POP3
and store them in the `mailbox/` folder. This folder weighs 97M and if we
compress it with gzip, we end up with 23M. The problem is that we need to
decompress the `mailbox.tar.gz` document to extract the emails.
This is where the PACK file comes in handy: it only weighs 28M (so we're very
close to what `tar` and `gzip` can do) but we can rebuild our emails without
unpacking everything:
```shell
$ blaze.pack index mailbox.pack
$ blaze.pack list mailbox.pack | head -n1
0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33
$ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33
Delivered-To: romain.calascibetta@gmail.com
...
```
Like Git, we now associate a hash with our emails and can retrieve them using
this hash. Like Git, we also calculate the `*.idx` file to associate the hash
with the position of the email in our PACK file. Just like Git (with `git show`
or `git cat-file`), we can now access our emails very quickly. So we now have a
database system (read-only) for our emails: we can now archive our emails!
Let's have a closer look at this PACK file. We've developed a tool more or less
similar to `git verify-pack` which lists all the objects in our PACK file and,
above all, gives us information such as the number of bytes needed to store
these objects:
```shell
$ blaze.pack verify mailbox.pack
4e9795e268313245f493d9cef1b5ccf30cc92c33 a 12 186 6257b7d4
...
517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b 666027 223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a
```
It shows the hash of our object, its type (A for the structure of our email, B
for the content), its position in the PACK file, the number of bytes used to
store the object (!) and finally the depth of the patch, the checksum, and the
source of the patch needed to rebuild the object.
Here, our first object is not patched, but the next object is. Note that it
only needs 223 bytes in the PACK file. But what is the real size of this
object?
```shell
$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \
--raw --without-metadata | wc -c
2014
```
So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of
10! In this case, the object is the content of an email. Guess which one? A
GitHub notification! If we go back to our very first example, we saw that we
could compress with a ratio close to 2. We could go further with zlib: we
compress the patch too. This example allows us to introduce one last feature of
PACK files: the depth.
```shell
$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8
kind: b
length: 2014 byte(s)
depth: 10
cache misses: 586
cache hits: 0
tree: 000026ab
Δ 00007f78
...
Δ 0009ef74
Δ 000a29ab
...
```
In our example, our object requires a source which, in turn, is a patch
requiring another source, and so on (you can see this chain in the `tree`).
The length of this patch chain corresponds to the depth of our object. There is
therefore a succession of patches between objects. What Carton tries to do is
to find the best patch from a window of possibilities and keep the best
candidates for reuse. If we unroll this chain of patches, we find a "base"
object (at `0x000026ab`) that is just compressed with zlib and the latter is
also the content of another GitHub notification email. This shows that Carton
is well on its way to finding the best candidate for the patch, which should be
similar content, moreover, another GitHub notification.
The idea is to sacrifice a little computing time (in the reconstruction of
objects via their patches) to gain in compression ratio. It's fair to say that
a very long patch chain can degrade performance. However, there is a limit in
Git and Carton: a chain can't be longer than 50. Another point is the search for
the candidate source for the patch, which is often physically close to the patch
(within a few bytes): reading the PACK file by page (thanks to [Cachet][cachet])
sometimes gives access to 3 or 4 objects, which have a certain chance of being
patched together.
Let's take the example of Carton and a Git object:
```shell
$ carton get pack-*.idx eaafd737886011ebc28e6208e03767860c22e77d
...
cache misses: 62
cache hits: 758
tree: 160720bb
Δ 160ae4bc
Δ 160ae506
Δ 160ae575
Δ 160ae5be
Δ 160ae5fc
Δ 160ae62f
Δ 160ae667
Δ 160ae6a5
Δ 160ae6db
Δ 160ae72a
Δ 160ae766
Δ 160ae799
Δ 160ae81e
Δ 160ae858
Δ 16289943
```
We can see here that we had to load 62 pages, but that we also reused the pages
we'd already read 758 times. We can also see that the offset of the patches
(which can be seen in Tree) is always close (the objects often follow each
other).
### Mbox and real emails
In a way, the concrete cases we use here are my emails. There may be a fairly
simple bias, which is that all these emails have the same destination:
romain.calascibetta@gmail.com. This is a common point which can also have a
significant impact on compression with `duff`. We will therefore try another
corpus, the archives of certain mailing lists relating to OCaml:
[lists.ocaml.org-archive](https://github.com/ocaml/lists.ocaml.org-archive)
```shell
$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
-o opam-devel.pack
$ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
> opam-devel.mbox.gzip
$ du -sh opam-devel.pack opam-devel.mbox.gzip \
lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
3.9M opam-devel.pack
2.0M opam-devel.mbox.gzip
10M lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
```
The compression ratio is a bit worse than before, but we're still on to
something interesting. Here again we can take an object from our PACK file and
see how the compression between objects reacts:
```shell
$ blaze.pack index opam-devel.pack
...
09bbd28303c8aafafd996b56f9c071a3add7bd92 b 362504 271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35
$ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
--without-metadata --raw | wc -c
2098
```
Once again, we see a ratio of 10! Here the object corresponds to the header of
an email. This is compressed with other email headers. This is the situation
where the fields are common to all your emails (`From`, `Subject`, etc.).
Carton successfully patches headers together and email content together.
## Next things
All the work done on email archiving is aimed at producing a unikernel (`void`)
that could archive all incoming emails. Finally, this unikernel could send the
archive back (via an email!) to those who want it. This is one of our goals for
implementing a mailing list in OCaml with unikernels.
Another objective is to create a database system for emails in order to offer
two features to the user that we consider important:
- quick and easy access to emails
- save disk space through compression
With this system, we can extend the method of indexing emails with other
information such as the keywords found in the emails. This will enable us,
among other things, to create an email search engine!
## Conclusion
This milestone in our PTT project was quite long, as we were very interested in
metrics such as compression ratio and software execution speed.
The experience we'd gained with emails (in particular with Mr. MIME) enabled us
to move a little faster, especially in terms of serializing our emails. Our
experience with ocaml-git also enabled us to identify the benefits of the PACK
file for emails.
But the development of [Miou][miou] was particularly helpful in satisfying us in
terms of program execution time, thanks to the ability to parallelize certain
calculations quite easily.
The format is still a little rough and not quite ready for the development of a
keyword-based e-mail indexing system, but it provides a good basis for the rest
of our project.
So, if you like what we're doing and want to help, you can make a donation via
[GitHub][donate-github] or using our [IBAN][donate-iban].
[patience-diff]: https://opensource.janestreet.com/patdiff/
[eugene-myers-diff]: https://www.nathaniel.ai/myers-diff/
[duff]: https://github.com/mirage/duff
[rabin]: https://en.wikipedia.org/wiki/Rabin_fingerprint
[rolling-hash]: https://en.wikipedia.org/wiki/Rolling_hash
[merkle-tree]: https://en.wikipedia.org/wiki/Merkle_tree
[encore]: https://github.com/mirage/encore
[mrmime]: https://github.com/mirage/mrmime
[afl]: https://afl-1.readthedocs.io/en/latest/fuzzing.html
[hamlet]: https://github.com/mirage/hamlet
[blaze]: https://github.com/dinosaure/blaze
[pop3]: https://en.wikipedia.org/wiki/Post_Office_Protocol
[tls]: https://github.com/mirleft/ocaml-tls
[mbox]: https://en.wikipedia.org/wiki/Mbox
[donate-github]: https://github.com/sponsors/robur-coop
[donate-iban]: https://robur.coop/Donate
[miou]: https://github.com/robur-coop/miou