blog.robur.coop/articles/2025-01-07-carton-and-cachet.md at 04e71b248539e25cad48b68c1aeb729da18d3669

reynir/blog.robur.coop

Fork 0

forked from robur/blog.robur.coop

Calascibetta Romain 04e71b2485 Add @hannesm's reviews

2025-01-13 17:51:43 +01:00

18 KiB

Raw Blame History

date

title

description

Back to Git and patches

In our Carton annoucement, we talk about 2 levels of compression for Git objects, which are zlib compression and compression between objects using a patch.

Furthermore, if we have 2 blobs (2 versions of a file), one of which contains 'A' and the other contains 'A+B', the second blob will probably be saved in the form of a patch requiring the contents of the first blob and adding '+B'. At a higher level and according to our use of Git, we understand that this second level of compression is very interesting: we generally just add/remove few lines (like introduce a new function) or delete some (removing code) in our files of our project.

However, there is a bias in what Git does and what we perceive. We often think that when it comes to patching in the case of Git, we think of the patience diff or the Eugene Myers diff. While the latter offer the advantage of readability in terms of knowing what has been added or deleted between two files, they are not necessarily optimal for producing a small patch.

In reality, what interests us in the case of the storage and transmission of these patches over the network is not a form of readability in these patches but an optimality in what can be considered as common between two files and what is not. It is at this stage that the use of duff is introduced.

This is a small library which can generate a patch between two files according to the series of bytes common to both files. We're talking about 'series of bytes' here because these elements common to our two files are not necessary human readable. To find these series of common bytes, we use Rabin's fingerprint algorithm: a rolling hash used since time immemorial.

Patches and emails

So, as far as emails are concerned, it's fairly obvious that there are many common "words" to all your emails. The simple word From: should exist in all your emails.

From this simple idea, we can understand the impact, the headers of your emails are more or less similar and have more or less the same content. The idea of duff, applied to your emails, is to consider these other emails as a "slightly" different version of your first email.

we can store a single raw email
and we build patches of all your other emails from this first one

A fairly concrete example of this compression through patches and emails is 2 notification emails from GitHub: these are quite similar, particularly in the header. Even the content is just as similar: the HTML remains the same, only the commentary differs.

$ carton diff github.01.eml github.02.eml -o patch.diff
$ du -sb github.01.eml github.02.eml patch.diff
9239	github.01.eml
9288	github.02.eml
5136	patch.diff

This example shows that our patch for rebuilding github.02.eml from github.01.eml is almost 2 times smaller in size. In this case, with the PACK format, this patch will also be compressed with zlib (and we can reach ~2900 bytes, so 3 times smaller).

Compress and compress!

To put this into perspective, a compression algorithm like zlib can also have such a ratio (3 times smaller). But the latter also needs to serialise the Huffman tree required for compression (in the general case). What can be observed is that concatenating separately compressed emails makes it difficult to maintain such a ratio. Worse, concatenating all the emails before compression and compressing them all gives us a better ratio!

That's what the PACK file is all about: the aim is to be able to concatenate these compressed emails and keep an interesting overall compression ratio. This is the reason for the patch, to reduce the objects even further so that the impact of the zlib header on all our objects is minimal and, above all, so that we can access to objects without having to decompress the previous ones (as we would have to do for a *.tar.gz archive, for example).

The initial intuition about the emails was right, they do indeed share quite a few elements together and in the end we were able to save ~4000 bytes in our GitHub notification example.

Isomorphism, DKIM and ARC

One attribute that we wanted to pay close attention to throughout our experimentation was "isomorphism". This property is very simple: imagine a function that takes an email as input and transforms it into another value using a method (such as compression). Isomorphism ensures that we can 'undo' this method and obtain exactly the same result again:

  decode(encode(x)) == x

This property is very important for emails because signatures exist in your email and these signatures result from the content of your email. If the email changes, these signatures change too.

For instance, the DKIM signature allows you to sign an email and check its integrity on receipt. ARC (which will be our next objective) also signs your emails, but goes one step further: all the relays that receive your email and send it back to the real destination must add a new ARC signature, just like adding a new block to the Bitcoin blockchain.

So you need to make sure that the way you serialise your email (in a PACK file) doesn't alter the content in order to keep these signatures valid! It just so happens that here too we have a lot of experience with Git. Git has the same constraint with Merkle-Trees and as far as we're concerned, we've developed a library that allows you to generate an encoder and a decoder from a description and that respects the isomorphism property by construction: the encore library.

Then we could store our emails as they are in the PACK file. However, the advantage of duff really comes into play when several objects are similar. In the case of Git, tree objects are similar but they are not similar with commits, for example. For emails, there is also such a distinction: the email headers are similar but they are not similar to the email content.

You can therefore try to "split" emails into 2 parts, the header on one side and the content on the other. We would then have a third value which would tell us how to reconstruct our complete email (i.e. identify where the header is and identify where the content is).

However, after years of reading email RFCs, things are much more complex. Above all, this experience has enabled me to synthesise a skeleton that all emails have:

(* multipart-body :=
      [preamble CRLF]
      --boundary transport-padding CRLF
      part
      ( CRLF --boundary transport-padding CRLF part )*
      CRLF
      --boundary-- transport-padding
      [CRLF epilogue]

   part := headers ( CRLF body )?
*)

type 'octet body =
  | Multipart of 'octet multipart
  | Single of 'octet option
  | Message of 'octet t

and 'octet part = { headers : 'octet; body : 'octet body }

and 'octet multipart =
  { preamble : string
  ; epilogue : string * transport_padding;
  ; boundary : string
  ; parts : (transport_padding * 'octet part) list }

and 'octet t = 'octet part

As you can see, the distinction is not only between the header and the content but also between the parts of an email as soon as it has an attachment. You can also have an email inside an email (and I'm always surprised to see that this particular case is frequent). Finally, there's the annoying preamble and epilogue of an email with several parts, which is often empty but necessary: you always have to ensure isomorphism — even for "useless" bytes, they count for signatures.

We'll therefore need to serialise this structure and all we have to do is transform a string t and SHA1.t t so that our structure no longer contains the actual content of our emails but a unique identifier referring to this content and which will be available in our PACK file.

module Format : sig
  val t : SHA1.t Encore.t
end

let decode =
  let parser = Encore.to_angstrom Format.t in
  Angstrom.parse_string ~consume:All parser str

let encode =
  let emitter = Encore.to_lavoisier Format.t in
  Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter

However, we need to check that the isomorphism is respected. You should be aware that work on Mr. MIME has already been done on this subject with the afl fuzzer: check our assertion x == decode(encode(x)). This ability to check isomorphism using afl has enabled us to use the latter to generate valid random emails. This allows me to reintroduce you to the hamlet project, perhaps the biggest database of valid — but incomprehensible — emails. So we've checked that our encoder/decoder for “splitting” our emails respects isomophism on this million emails.

Carton, POP3 & mbox, some metrics

We can therefore split an email into several parts and calculate an optimal patch between two similar pieces of content. So now you can start packaging! This is where I'm going to reintroduce a tool that hasn't been released yet, but which allows me to go even further with emails: blaze.

This little tool is my Swiss army knife for emails! And it's in this tool that we're going to have fun deriving Carton so that it can manipulate emails rather than Git objects. So we've implemented the very basic POP3 protocol (and thanks to ocaml-tls for offering a free encrypted connection) as well as the mbox format.

Both are not recommended. The first is an old protocol and interacting with Gmail, for example, is very slow. The second is an old, non-standardised format for storing your emails — and unfortunately this may be the format used by your email client. After resolving a few bugs such as the unspecified behaviour of pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up with lots of emails that you'll have fun packaging!

$ mkdir mailbox
$ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \
  -u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' > mails.lst
$ blaze.pack make -o mailbox.pack mails.lst
$ tar czf mailbox.tar.gz mailbox
$ du -sh mailbox mailbox.pack mailbox.tar.gz
97M     mailbox
28M     mailbox.pack
23M     mailbox.tar.gz

In this example, we download the latest emails from the last 30 days via POP3 and store them in the mailbox/ folder. This folder weighs 97M and if we compress it with gzip, we end up with 23M. The problem is that we need to decompress the mailbox.tar.gz document to extract the emails.

This is where the PACK file comes in handy: it only weighs 28M (so we're very close to what tar and gzip can do) but we can rebuild our emails without unpacking everything:

$ blaze.pack index mailbox.pack
$ blaze.pack list mailbox.pack | head -n1
0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33
$ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33
Delivered-To: romain.calascibetta@gmail.com
...

Like Git, we now associate a hash with our emails and can retrieve them using this hash. Like Git, we also calculate the *.idx file to associate the hash with the position of the email in our PACK file. Just like Git (with git show or git cat-file), we can now access our emails very quickly. So we now have a database system (read-only) for our emails: we can now archive our emails!

Let's have a closer look at this PACK file. We've developed a tool more or less similar to git verify-pack which lists all the objects in our PACK file and, above all, gives us information such as the number of bytes needed to store these objects:

$ blaze.pack verify mailbox.pack
4e9795e268313245f493d9cef1b5ccf30cc92c33 a       12     186    6257b7d4
...
517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b   666027     223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a

It shows the hash of our object, its type (A for the structure of our email, B for the content), its position in the PACK file, the number of bytes used to store the object (!) and finally the depth of the patch, the checksum, and the source of the patch needed to rebuild the object.

Here, our first object is not patched, but the next object is. Note that it only needs 223 bytes in the PACK file. But what is the real size of this object?

$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \
  --raw --without-metadata | wc -c
2014

So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of 10! In this case, the object is the content of an email. Guess which one? A GitHub notification! If we go back to our very first example, we saw that we could compress with a ratio close to 2. We could go further with zlib: we compress the patch too. This example allows us to introduce one last feature of PACK files: the depth.

$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8
kind:         b
length:       2014 byte(s)
depth:        10
cache misses: 586
cache hits:   0
tree:           000026ab
              Δ 00007f78
              ...
              Δ 0009ef74
              Δ 000a29ab
...

In our example, our object requires a source which, in turn, is a patch requiring another source, and so on (you can see this chain in the tree). The length of this patch chain corresponds to the depth of our object. There is therefore a succession of patches between objects. What Carton tries to do is to find the best patch from a window of possibilities and keep the best candidates for reuse. If we unroll this chain of patches, we find a "base" object (at 0x000026ab) that is just compressed with zlib and the latter is also the content of another GitHub notification email. This shows that Carton is well on its way to finding the best candidate for the patch, which should be similar content, moreover, another GitHub notification.

Mbox and real emails

In a way, the concrete cases we use here are my emails. There may be a fairly simple bias, which is that all these emails have the same destination: romain.calascibetta@gmail.com. This is a common point which can also have a significant impact on compression with duff. We will therefore try another corpus, the archives of certain mailing lists relating to OCaml: lists.ocaml.org-archive

$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
  -o opam-devel.pack
$ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
  > opam-devel.mbox.gzip
$ du -sh opam-devel.pack opam-devel.mbox.gzip \
  lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
3.9M    opam-devel.pack
2.0M    opam-devel.mbox.gzip
10M     lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox

The compression ratio is a bit worse than before, but we're still on to something interesting. Here again we can take an object from our PACK file and see how the compression between objects reacts:

$ blaze.pack index opam-devel.pack
...
09bbd28303c8aafafd996b56f9c071a3add7bd92 b  362504   271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35
$ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
  --without-metadata --raw | wc -c
2098

Once again, we see a ratio of 10! Here the object corresponds to the header of an email. This is compressed with other email headers. This is the situation where the fields are common to all your emails (From, Subject, etc.). Carton successfully patches headers together and email content together.

Next things

All the work done on email archiving is aimed at producing a unikernel (void) that could archive all incoming emails. Finally, this unikernel could send the archive back (via an email!) to those who want it. This is one of our goals for implementing a mailing list in OCaml with unikernels.

Another objective is to create a database system for emails in order to offer two features to the user that we consider important:

quick and easy access to emails
save disk space through compression

With this system, we can extend the method of indexing emails with other information such as the keywords found in the emails. This will enable us, among other things, to create an email search engine!

Conclusion

This milestone in our PTT project was quite long, as we were very interested in metrics such as compression ratio and software execution speed.

The experience we'd gained with emails (in particular with Mr. MIME) enabled us to move a little faster, especially in terms of serializing our emails. Our experience with ocaml-git also enabled us to identify the benefits of the PACK file for emails.

But the development of Miou was particularly helpful in satisfying us in terms of program execution time, thanks to the ability to parallelize certain calculations quite easily.

The format is still a little rough and not quite ready for the development of a keyword-based e-mail indexing system, but it provides a good basis for the rest of our project.

So, if you like what we're doing and want to help, you can make a donation via GitHub or using our IBAN.

18 KiB Raw Blame History