Git, Carton and emails
+We are pleased to announce the release of Carton 1.0.0 and Cachet. You can have +an overview of these libraries in our announcement on the OCaml forum. This +article goes into more detail about the PACK format and its use for archiving +your emails.
+Back to Git and patches
+In our Carton annoucement, we talk about 2 levels of compression for Git +objects, which are zlib compression and compression between objects using a +patch.
+Furthermore, if we have 2 blobs (2 versions of a file), one of which contains +'A' and the other contains 'A+B', the second blob will probably be saved in the +form of a patch requiring the contents of the first blob and adding '+B'. At a +higher level and according to our use of Git, we understand that this second +level of compression is very interesting: we generally just add/remove few lines +(like introduce a new function) or delete some (removing code) in our files of +our project.
+However, there is a bias in what Git does and what we perceive. We often think +that when it comes to patching in the case of Git, we think of the +patience diff or the Eugene Myers diff. +While the latter offer the advantage of readability in terms of knowing what has +been added or deleted between two files, they are not necessarily optimal for +producing a small patch.
+In reality, what interests us in the case of the storage and transmission of +these patches over the network is not a form of readability in these patches but +an optimality in what can be considered as common between two files and what is +not. It is at this stage that the use of duff is introduced.
+This is a small library which can generate a patch between two files according +to the series of bytes common to both files. We're talking about 'series of +bytes' here because these elements common to our two files are not necessary +human readable. To find these series of common bytes, we use Rabin's +fingerprint algorithm: a rolling hash used since time +immemorial.
+Patches and emails
+So, as far as emails are concerned, it's fairly obvious that there are many
+common "words" to all your emails. The simple word From:
should exist in all
+your emails.
From this simple idea, we can understand the impact, the headers of your emails
+are more or less similar and have more or less the same content. The idea of
+duff
, applied to your emails, is to consider these other emails as a
+"slightly" different version of your first email.
-
+
- we can store a single raw email +
- and we build patches of all your other emails from this first one +
A fairly concrete example of this compression through patches and emails is 2 +notification emails from GitHub: these are quite similar, particularly in the +header. Even the content is just as similar: the HTML remains the same, only +the commentary differs.
+$ carton diff github.01.eml github.02.eml -o patch.diff
+$ du -sb github.01.eml github.02.eml patch.diff
+9239 github.01.eml
+9288 github.02.eml
+5136 patch.diff
+
+This example shows that our patch for rebuilding github.02.eml
from
+github.01.eml
is almost 2 times smaller in size. In this case, with the PACK
+format, this patch will also be compressed with zlib (and we can reach ~2900
+bytes, so 3 times smaller).
Compress and compress!
+To put this into perspective, a compression algorithm like zlib can also have +such a ratio (3 times smaller). But the latter also needs to serialise the +Huffman tree required for compression (in the general case). What can be +observed is that concatenating separately compressed emails makes it difficult +to maintain such a ratio. Worse, concatenating all the emails before compression +and compressing them all gives us a better ratio!
+That's what the PACK file is all about: the aim is to be able to concatenate
+these compressed emails and keep an interesting overall compression ratio. This
+is the reason for the patch, to reduce the objects even further so that the
+impact of the zlib header on all our objects is minimal and, above
+all, so that we can access to objects without having to decompress the
+previous ones (as we would have to do for a *.tar.gz
archive, for example).
The initial intuition about the emails was right, they do indeed share quite a +few elements together and in the end we were able to save ~4000 bytes in our +GitHub notification example.
+Isomorphism, DKIM and ARC
+One attribute that we wanted to pay close attention to throughout our +experimentation was "isomorphism". This property is very simple: imagine a +function that takes an email as input and transforms it into another value +using a method (such as compression). Isomorphism ensures that we can 'undo' +this method and obtain exactly the same result again:
+ decode(encode(x)) == x
+
+This property is very important for emails because signatures exist in your +email and these signatures result from the content of your email. If the email +changes, these signatures change too.
+For instance, the DKIM signature allows you to sign an email and check its +integrity on receipt. ARC (which will be our next objective) also signs your +emails, but goes one step further: all the relays that receive your email and +send it back to the real destination must add a new ARC signature, just like +adding a new block to the Bitcoin blockchain.
+So you need to make sure that the way you serialise your email (in a PACK file) +doesn't alter the content in order to keep these signatures valid! It just so +happens that here too we have a lot of experience with Git. Git has the same +constraint with Merkle-Trees and as far as we're concerned, we've +developed a library that allows you to generate an encoder and a decoder from a +description and that respects the isomorphism property by construction: the +encore library.
+Then we could store our emails as they are in the PACK file. However, the
+advantage of duff
really comes into play when several objects are similar. In
+the case of Git, tree objects are similar but they are not similar with
+commits, for example. For emails, there is also such a distinction: the email
+headers are similar but they are not similar to the email content.
You can therefore try to "split" emails into 2 parts, the header on one side and +the content on the other. We would then have a third value which would tell us +how to reconstruct our complete email (i.e. identify where the header is and +identify where the content is).
+However, after years of reading email RFCs, things are much more complex. Above +all, this experience has enabled me to synthesise a skeleton that all emails +have:
+(* multipart-body :=
+ [preamble CRLF]
+ --boundary transport-padding CRLF
+ part
+ ( CRLF --boundary transport-padding CRLF part )*
+ CRLF
+ --boundary-- transport-padding
+ [CRLF epilogue]
+
+ part := headers ( CRLF body )?
+*)
+
+type 'octet body =
+ | Multipart of 'octet multipart
+ | Single of 'octet option
+ | Message of 'octet t
+
+and 'octet part = { headers : 'octet; body : 'octet body }
+
+and 'octet multipart =
+ { preamble : string
+ ; epilogue : string * transport_padding;
+ ; boundary : string
+ ; parts : (transport_padding * 'octet part) list }
+
+and 'octet t = 'octet part
+
+As you can see, the distinction is not only between the header and the content +but also between the parts of an email as soon as it has an attachment. You can +also have an email inside an email (and I'm always surprised to see that this +particular case is frequent). Finally, there's the annoying preamble and +epilogue of an email with several parts, which is often empty but necessary: +you always have to ensure isomorphism — even for "useless" bytes, they count for +signatures.
+We'll therefore need to serialise this structure and all we have to do is
+transform a string t
and SHA1.t t
so that our structure no longer contains
+the actual content of our emails but a unique identifier referring to this
+content and which will be available in our PACK file.
module Format : sig
+ val t : SHA1.t Encore.t
+end
+
+let decode =
+ let parser = Encore.to_angstrom Format.t in
+ Angstrom.parse_string ~consume:All parser str
+
+let encode =
+ let emitter = Encore.to_lavoisier Format.t in
+ Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter
+
+However, we need to check that the isomorphism is respected. You should be
+aware that work on Mr. MIME has already been done on this subject with
+the afl fuzzer: check our assertion x == decode(encode(x))
. This
+ability to check isomorphism using afl has enabled us to use the latter to
+generate valid random emails. This allows me to reintroduce you to the
+hamlet project, perhaps the biggest database of valid — but
+incomprehensible — emails. So we've checked that our encoder/decoder for
+“splitting” our emails respects isomophism on this million emails.
Carton, POP3 & mbox, some metrics
+We can therefore split an email into several parts and calculate an optimal +patch between two similar pieces of content. So now you can start packaging! +This is where I'm going to reintroduce a tool that hasn't been released yet, +but which allows me to go even further with emails: blaze.
+This little tool is my Swiss army knife for emails! And it's in this tool +that we're going to have fun deriving Carton so that it can manipulate emails +rather than Git objects. So we've implemented the very basic POP3 +protocol (and thanks to ocaml-tls for offering a free encrypted +connection) as well as the mbox format.
+Both are not recommended. The first is an old protocol and interacting with +Gmail, for example, is very slow. The second is an old, non-standardised format +for storing your emails — and unfortunately this may be the format used by your +email client. After resolving a few bugs such as the unspecified behaviour of +pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up +with lots of emails that you'll have fun packaging!
+$ mkdir mailbox
+$ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \
+ -u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' > mails.lst
+$ blaze.pack make -o mailbox.pack mails.lst
+$ tar czf mailbox.tar.gz mailbox
+$ du -sh mailbox mailbox.pack mailbox.tar.gz
+97M mailbox
+28M mailbox.pack
+23M mailbox.tar.gz
+
+In this example, we download the latest emails from the last 30 days via POP3
+and store them in the mailbox/
folder. This folder weighs 97M and if we
+compress it with gzip, we end up with 23M. The problem is that we need to
+decompress the mailbox.tar.gz
document to extract the emails.
This is where the PACK file comes in handy: it only weighs 28M (so we're very
+close to what tar
and gzip
can do) but we can rebuild our emails without
+unpacking everything:
$ blaze.pack index mailbox.pack
+$ blaze.pack list mailbox.pack | head -n1
+0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33
+$ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33
+Delivered-To: romain.calascibetta@gmail.com
+...
+
+Like Git, we now associate a hash with our emails and can retrieve them using
+this hash. Like Git, we also calculate the *.idx
file to associate the hash
+with the position of the email in our PACK file. Just like Git (with git show
+or git cat-file
), we can now access our emails very quickly. So we now have a
+database system (read-only) for our emails: we can now archive our emails!
Let's have a closer look at this PACK file. We've developed a tool more or less
+similar to git verify-pack
which lists all the objects in our PACK file and,
+above all, gives us information such as the number of bytes needed to store
+these objects:
$ blaze.pack verify mailbox.pack
+4e9795e268313245f493d9cef1b5ccf30cc92c33 a 12 186 6257b7d4
+...
+517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b 666027 223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a
+
+It shows the hash of our object, its type (A for the structure of our email, B +for the content), its position in the PACK file, the number of bytes used to +store the object (!) and finally the depth of the patch, the checksum, and the +source of the patch needed to rebuild the object.
+Here, our first object is not patched, but the next object is. Note that it +only needs 223 bytes in the PACK file. But what is the real size of this +object?
+$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \
+ --raw --without-metadata | wc -c
+2014
+
+So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of +10! In this case, the object is the content of an email. Guess which one? A +GitHub notification! If we go back to our very first example, we saw that we +could compress with a ratio close to 2. We could go further with zlib: we +compress the patch too. This example allows us to introduce one last feature of +PACK files: the depth.
+$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8
+kind: b
+length: 2014 byte(s)
+depth: 10
+cache misses: 586
+cache hits: 0
+tree: 000026ab
+ Δ 00007f78
+ ...
+ Δ 0009ef74
+ Δ 000a29ab
+...
+
+In our example, our object requires a source which, in turn, is a patch
+requiring another source, and so on (you can see this chain in the tree
).
+The length of this patch chain corresponds to the depth of our object. There is
+therefore a succession of patches between objects. What Carton tries to do is
+to find the best patch from a window of possibilities and keep the best
+candidates for reuse. If we unroll this chain of patches, we find a "base"
+object (at 0x000026ab
) that is just compressed with zlib and the latter is
+also the content of another GitHub notification email. This shows that Carton
+is well on its way to finding the best candidate for the patch, which should be
+similar content, moreover, another GitHub notification.
The idea is to sacrifice a little computing time (in the reconstruction of +objects via their patches) to gain in compression ratio. It's fair to say that +a very long patch chain can degrade performance. However, there is a limit in +Git and Carton: a chain can't be longer than 50. Another point is the search for +the candidate source for the patch, which is often physically close to the patch +(within a few bytes): reading the PACK file by page (thanks to [Cachet][cachet]) +sometimes gives access to 3 or 4 objects, which have a certain chance of being +patched together.
+Let's take the example of Carton and a Git object:
+$ carton get pack-*.idx eaafd737886011ebc28e6208e03767860c22e77d
+...
+cache misses: 62
+cache hits: 758
+tree: 160720bb
+ Δ 160ae4bc
+ Δ 160ae506
+ Δ 160ae575
+ Δ 160ae5be
+ Δ 160ae5fc
+ Δ 160ae62f
+ Δ 160ae667
+ Δ 160ae6a5
+ Δ 160ae6db
+ Δ 160ae72a
+ Δ 160ae766
+ Δ 160ae799
+ Δ 160ae81e
+ Δ 160ae858
+ Δ 16289943
+
+We can see here that we had to load 62 pages, but that we also reused the pages +we'd already read 758 times. We can also see that the offset of the patches +(which can be seen in Tree) is always close (the objects often follow each +other).
+Mbox and real emails
+In a way, the concrete cases we use here are my emails. There may be a fairly
+simple bias, which is that all these emails have the same destination:
+romain.calascibetta@gmail.com. This is a common point which can also have a
+significant impact on compression with duff
. We will therefore try another
+corpus, the archives of certain mailing lists relating to OCaml:
+lists.ocaml.org-archive
$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
+ -o opam-devel.pack
+$ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
+ > opam-devel.mbox.gzip
+$ du -sh opam-devel.pack opam-devel.mbox.gzip \
+ lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
+3.9M opam-devel.pack
+2.0M opam-devel.mbox.gzip
+10M lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
+
+The compression ratio is a bit worse than before, but we're still on to +something interesting. Here again we can take an object from our PACK file and +see how the compression between objects reacts:
+$ blaze.pack index opam-devel.pack
+...
+09bbd28303c8aafafd996b56f9c071a3add7bd92 b 362504 271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35
+$ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
+ --without-metadata --raw | wc -c
+2098
+
+Once again, we see a ratio of 10! Here the object corresponds to the header of
+an email. This is compressed with other email headers. This is the situation
+where the fields are common to all your emails (From
, Subject
, etc.).
+Carton successfully patches headers together and email content together.
Next things
+All the work done on email archiving is aimed at producing a unikernel (void
)
+that could archive all incoming emails. Finally, this unikernel could send the
+archive back (via an email!) to those who want it. This is one of our goals for
+implementing a mailing list in OCaml with unikernels.
Another objective is to create a database system for emails in order to offer +two features to the user that we consider important:
+-
+
- quick and easy access to emails +
- save disk space through compression +
With this system, we can extend the method of indexing emails with other +information such as the keywords found in the emails. This will enable us, +among other things, to create an email search engine!
+Conclusion
+This milestone in our PTT project was quite long, as we were very interested in +metrics such as compression ratio and software execution speed.
+The experience we'd gained with emails (in particular with Mr. MIME) enabled us +to move a little faster, especially in terms of serializing our emails. Our +experience with ocaml-git also enabled us to identify the benefits of the PACK +file for emails.
+But the development of Miou was particularly helpful in satisfying us in +terms of program execution time, thanks to the ability to parallelize certain +calculations quite easily.
+The format is still a little rough and not quite ready for the development of a +keyword-based e-mail indexing system, but it provides a good basis for the rest +of our project.
+So, if you like what we're doing and want to help, you can make a donation via +GitHub or using our IBAN.
+ +