From 086485e904c0689846f8b72c4808f5e64748f674 Mon Sep 17 00:00:00 2001 From: Calascibetta Romain Date: Fri, 10 Jan 2025 18:27:19 +0100 Subject: [PATCH] Add draft about carton --- articles/2025-01-07-carton-and-cachet.md | 385 +++++++++++++++++++++++ 1 file changed, 385 insertions(+) create mode 100644 articles/2025-01-07-carton-and-cachet.md diff --git a/articles/2025-01-07-carton-and-cachet.md b/articles/2025-01-07-carton-and-cachet.md new file mode 100644 index 0000000..9b86086 --- /dev/null +++ b/articles/2025-01-07-carton-and-cachet.md @@ -0,0 +1,385 @@ +--- +date: 2024-10-29 +title: Git, Carton, Mmap and emails +description: A way to store your emails +tags: + - emails + - storage + - Git +author: + name: Romain Calascibetta + email: romain.calascibetta@gmail.com + link: https://blog.osau.re/ +breaks: false +--- + +We are pleased to announce the release of Carton 1.0.0 and Cachet. You can have +an overview of these libraries in our announcement on the OCaml forum. This +article goes into more detail about the PACK format and its use for archiving +your emails. + +## Back to Git and patches + +In our Carton annoucement, we talk about 2 levels of compression for Git +objects, which are zlib compression and compression between objects using a +patch. + +Furthermore, if we have 2 blobs (2 versions of a file), one of which contains +‘A’ and the other contains ‘A+B’, the second blob will probably be saved in the +form of a patch requiring the contents of the first blob and adding ‘+B’. At a +higher level and according to our use of Git, we understand that this second +level of compression is very interesting: we generally just add/remove new +keywords in our files of our project. + +However, there is a bias in what Git does and what we perceive. We often think +that when it comes to patching in the case of Git, we think of the +[patience diff][patience-diff] or the [Eugene Myers diff][eugene-myers-diff]. +While the latter offer the advantage of readability in terms of knowing what has +been added or deleted between two files, they are not necessarily optimal for +producing a _small_ patch. + +In reality, what interests us in the case of the storage and transmission of +these patches over the network is not a form of lisibility in these patches but +an optimality in what can be considered as common between two files and what is +not. It is at this stage that the use of [duff][duff] is introduced. + +This is a small library which can generate a patch between two files according +to the series of bytes common to both files. We're talking about ‘series of +bytes’ here because these elements common to our two files are not necessary +humanly readable. To find these series of common bytes, we use [Rabin's +fingerprint][rabin] algorithm: [a rolling hash][rolling-hash] used since time +immemorial. + +### Patches and emails + +So, as far as emails are concerned, it's fairly obvious that there are many +common "words" to all your emails. The simple word `From:` should exist in all +your emails. + +From this simple idea, we can understand the impact, the headers of your emails +are more or less similar and have more or less the same content. The idea of +`duff`, applied to your emails, is to consider these other emails as a +"slightly" different version of your first email. +1) we can store a single raw email +2) and we build patches of all your other emails from this first one + +A fairly concrete example of this compression through patches and emails is 2 +notification emails from GitHub: these are quite similar, particularly in the +header. Even the content is just as similar: the HTML remains the same, only +the commentary differs. + +```shell +$ carton diff github.01.eml github.02.eml -o patch.diff +$ du -sb github.01.eml github.02.eml patch.diff +9239 github.01.eml +9288 github.02.eml +5136 patch.diff +``` + +This example shows that our patch for rebuilding `github.02.eml` from +`github.01.eml` is almost 2 times smaller in size. In this case, with the PACK +format, this patch will also be compressed with zlib (and we can reach ~2900 +bytes, so 3 times smaller). + +#### Compress and compress! + +To put this into perspective, a compression algorithm like zlib can also have +such a ratio (3 times smaller). But the latter also needs to serialise the +Huffman tree required for compression (in the general case). What can be +observed is that concatenating separately compressed emails makes it difficult +to maintain such a ratio. Worse, concatenating all the emails before compression +and compressing them all gives us a better ratio! + +That's what the PACK file is all about: the aim is to be able to concatenate +these compressed emails and keep an interesting overall compression ratio. This +is the reason for the patch, to reduce the objects even further so that the +impact of the zlib _header_ on all our objects is minimal and, above +all, so that we can access to objects **without** having to decompress the +previous ones (as we would have to do for a `*.tar.gz` archive, for example). + +The initial intuition about the emails was right, they do indeed share quite a +few elements together and in the end we were able to save ~4000 bytes in our +example GitHub notification example. + +## Isomorphism, DKIM and ARC + +One attribute that we wanted to pay close attention to throughout our +experimentation was "isomorphism". This property is very simple: imagine a +function that takes an email as input and transforms it into another value +using a method (such as compression). Isomorphism ensures that we can ‘undo’ +this method and obtain exactly the same result again: + +``` + x == decode(encode(x)) +``` + +This property is very important for emails because signatures exist in your +email and these signatures result from the content of your email. If the email +changes, these signatures change too. + +For instance, the DKIM signature allows you to sign an email and check its +integrity on receipt. ARC (which will be our next objective) also signs your +emails, but goes one step further: all the relays that receive your email and +send it back to the real destination must add a new ARC signature, just like +adding a new block to the Bitcoin blockchain. + +So you need to make sure that the way you serialise your email (in a PACK file) +doesn't alter the content in order to keep these signatures valid! It just so +happens that here too we have a lot of experience with Git. Git has the same +constraint with [Merkle-Trees][merkle-tree] and as far as we're concerned, we've +developed a library that allows you to generate an encoder and a decoder from a +description and that respects the isomorphism property _by construction_: the +[encore][encore] library. + +Then we could store our emails as they are in the PACK file. However, the +advantage of `duff` really comes into play when several objects are similar. In +the case of Git, tree objects are similar but they are not similar with +commits, for example. For emails, there is also such a distinction: the email +headers are similar but they are not similar to the email content. + +You can therefore try to "split" emails into 2 parts, the header on one side and +the content on the other. We would then have a third value which would tell us +how to reconstruct our complete email (i.e. identify where the header is and +identify where the content is). + +However, after years of reading email RFCs, things are much more complex. Above +all, this experience has enabled me to synthesise a skeleton that all emails +have: + +```ocaml +(* multipart-body := + [preamble CRLF] + --boundary transport-padding CRLF + part + ( CRLF --boundary transport-padding CRLF part )* + CRLF + --boundary-- transport-padding + [CRLF epilogue] + + part := headers ( CRLF body )? +*) + +type 'octet body = + | Multipart of 'octet multipart + | Single of 'octet option + | Message of 'octet t + +and 'octet part = { headers : 'octet; body : 'octet body } + +and 'octet multipart = + { preamble : string + ; epilogue : string * transport_padding; + ; boundary : string + ; parts : (transport_padding * 'octet part) list } + +and 'octet t = 'octet part +``` + +As you can see, the distinction is not only between the header and the content +but also between the parts of an email as soon as it has an attachment. You can +also have an email inside an email (and I'm always surprised to see that this +particular case is _frequent_). Finally, there's the annoying _preamble_ and +_epilogue_ of an email with several parts, which is often empty but necessary: +you always have to ensure isomorphism — even for "useless" bytes, they count for +signatures. + +We'll therefore need to serialise this structure and all we have to do is +transform a `string t` and `SHA1.t t` so that our structure no longer contains +the actual content of our emails but a unique identifier referring to this +content and which will be available in our PACK file. + +```ocaml +module Format : sig + val t : SHA1.t Encore.t +end + +let decode = + let parser = Encore.to_angstrom Format.t in + Angstrom.parse_string ~consume:All parser str + +let encode = + let emitter = Encore.to_lavoisier Format.t in + Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter +``` + +However, we need to check that the isomorphism is respected. You should be +aware that work on [Mr. MIME][mrmime] has already been done on this subject with +the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. So we went +one step further and generated random emails from this fuzzer! This allows me to +reintroduce you to the [hamlet][hamlet] project, perhaps the biggest database of +valid — but incomprehensible — emails. In this case, Mr. MIME does more than just +encode/decode as we want to do here, it also parses email addresses, dates, and +so on. Here, the aim is mainly to split our emails. We therefore took the time +to check the isomorphism on our `hamlet` database: the result is that among +these 1M emails, not one has been altered! + +## Carton, POP3 & mbox, some metrics + +We can therefore split an email into several parts and calculate an optimal +patch between two similar pieces of content. So now you can start packaging! +This is where I'm going to reintroduce a tool that hasn't been released yet, +but which allows me to go even further with emails: [blaze][blaze]. + +This little tool is my _Swiss army knife_ for emails! And it's in this tool +that we're going to have fun deriving Carton so that it can manipulate emails +rather than Git objects. So we've implemented the very basic [POP3][pop3] +protocol (and thanks to [ocaml-tls][tls] for offering a free encrypted +connection) as well as the [mbox][mbox] format. + +Both are **not** recommended. The first is an old protocol and interacting with +Gmail, for example, is very slow. The second is an old, non-standardised format +for storing your emails — and unfortunately this may be the format used by your +email client. After resolving a few bugs such as the unspecified behaviour of +pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up +with lots of emails that you'll have fun packaging! + +```shell +$ mkdir mailbox +$ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \ + -u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' > mails.lst +$ blaze.pack make -o mailbox.pack mails.lst +$ tar czf mailbox.tar.gz mailbox +$ du -sh mailbox mailbox.pack mailbox.tar.gz +97M mailbox +28M mailbox.pack +23M mailbox.tar.gz +``` + +In this example, we download the latest emails from the last 30 days via POP3 +and store them in the `mailbox/` folder. This folder weighs 97M and if we +compress it with gzip, we end up with 23M. The problem is that we need to +decompress the `mailbox.tar.gz` document to extract the emails. + +This is where the PACK file comes in handy: it only weighs 28M (so we're very +close to what `tar` and `gzip` can do) but we can rebuild our emails without +unpacking everything: + +```shell +$ blaze.pack index mailbox.pack +$ blaze.pack list mailbox.pack | head -n1 +0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33 +$ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33 +Delivered-To: romain.calascibetta@gmail.com +... +``` + +Like Git, we now associate a hash with our emails and can retrieve them using +this hash. Like Git, we also calculate the `*.idx` file to associate the hash +with the position of the email in our PACK file. Just like Git (with `git show` +or `git cat-file`), we can now access our emails very quickly. So we now have a +database system (read-only) for our emails: we can now archive our emails! + +Let's have a closer look at this PACK file. We've developed a tool more or less +similar to `git verify-pack` which lists all the objects in our PACK file and, +above all, gives us information such as the number of bytes needed to store +these objects: + +```shell +$ blaze.pack verify mailbox.pack +4e9795e268313245f493d9cef1b5ccf30cc92c33 a 12 186 6257b7d4 +... +517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b 666027 223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a +``` + +It shows the hash of our object, its type (A for the structure of our email, B +for the content), its position in the PACK file, the number of bytes used to +store the object (!) and finally the depth of the patch, the checksum, and the +source of the patch needed to rebuild the object. + +Here, our first object is not patched, but the next object is. Note that it +only needs 223 bytes in the PACK file. But what is the real size of this +object? + +```shell +$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \ + --raw --without-metadata | wc -c +2014 +``` + +So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of +10! In this case, the object is the content of an email. Guess which one? A +GitHub notification! If we go back to our very first example, we saw that we +could compress with a ratio close to 2. We could go further with zlib: we +compress the patch too. This example allows us to introduce one last feature of +PACK files: the depth. + +```shell +$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 +kind: b +length: 2014 byte(s) +depth: 10 +cache misses: 586 +cache hits: 0 +tree: 000026ab + Δ 00007f78 + ... + Δ 0009ef74 + Δ 000a29ab +... +``` + +In our example, our object requires a source which, in turn, is a patch +requiring another source, and so on (you can see this chain in the `tree`). +The length of this patch chain corresponds to the depth of our object. There is +therefore a succession of patches between objects. What Carton tries to do is +to find the best patch from a window of possibilities and keep the best +candidates for reuse. If we unroll this chain of patches, we find a "base" +object (at `0x000026ab`) that is just compressed with zlib and the latter is +also the content of another GitHub notification email: this checks that our +Rabin's fingerprinting algorithm works very well. + +### Mbox and real emails + +In a way, the concrete cases we use here are my emails. There may be a fairly +simple bias, which is that all these emails have the same destination: +romain.calascibetta@gmail.com. This is a common point which can also have a +significant impact on compression with `duff`. We will therefore try another +corpus, the archives of certain mailing lists relating to OCaml: +[lists.ocaml.org-archive](https://github.com/ocaml/lists.ocaml.org-archive) + +```shell +$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \ + -o opam-devel.pack +$ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \ + > opam-devel.mbox.gzip +$ du -sh opam-devel.pack opam-devel.mbox.gzip \ + lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox +3.9M opam-devel.pack +2.0M opam-devel.mbox.gzip +10M lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox +``` + +The compression ratio is a bit worse than before, but we're still on to +something interesting. Here again we can take an object from our PACK file and +see how the compression between objects reacts: + +```shell +$ blaze.pack index opam-devel.pack +... +09bbd28303c8aafafd996b56f9c071a3add7bd92 b 362504 271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35 +$ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \ + --without-metadata --raw | wc -c +2098 +``` + +Once again, we see a ratio of 10! Here the object corresponds to the header of +an email. This is compressed with other email headers. This is the situation +where the fields are common to all your emails (`From`, `Subject`, etc.). + +## Next things + +All the work done on email archiving is aimed at producing a unikernel (`void`) +that could archive all incoming emails. Finally, this unikernel could send the +archive back (via an email!) to those who want it. This is one of our goals for +implementing a mailing list in OCaml with unikernels. + +Another objective is to create a database system for emails in order to offer +two features to the user that we consider important: +- quick and easy access to emails +- save disk space through compression + +With this system, we can extend the method of indexing emails with other +information such as the keywords found in the emails. This will enable us, +among other things, to create an email search engine! + +## Conclusion