Add draft about carton
This commit is contained in:
parent
ec0dec16ef
commit
086485e904
1 changed files with 385 additions and 0 deletions
385
articles/2025-01-07-carton-and-cachet.md
Normal file
385
articles/2025-01-07-carton-and-cachet.md
Normal file
|
@ -0,0 +1,385 @@
|
|||
---
|
||||
date: 2024-10-29
|
||||
title: Git, Carton, Mmap and emails
|
||||
description: A way to store your emails
|
||||
tags:
|
||||
- emails
|
||||
- storage
|
||||
- Git
|
||||
author:
|
||||
name: Romain Calascibetta
|
||||
email: romain.calascibetta@gmail.com
|
||||
link: https://blog.osau.re/
|
||||
breaks: false
|
||||
---
|
||||
|
||||
We are pleased to announce the release of Carton 1.0.0 and Cachet. You can have
|
||||
an overview of these libraries in our announcement on the OCaml forum. This
|
||||
article goes into more detail about the PACK format and its use for archiving
|
||||
your emails.
|
||||
|
||||
## Back to Git and patches
|
||||
|
||||
In our Carton annoucement, we talk about 2 levels of compression for Git
|
||||
objects, which are zlib compression and compression between objects using a
|
||||
patch.
|
||||
|
||||
Furthermore, if we have 2 blobs (2 versions of a file), one of which contains
|
||||
‘A’ and the other contains ‘A+B’, the second blob will probably be saved in the
|
||||
form of a patch requiring the contents of the first blob and adding ‘+B’. At a
|
||||
higher level and according to our use of Git, we understand that this second
|
||||
level of compression is very interesting: we generally just add/remove new
|
||||
keywords in our files of our project.
|
||||
|
||||
However, there is a bias in what Git does and what we perceive. We often think
|
||||
that when it comes to patching in the case of Git, we think of the
|
||||
[patience diff][patience-diff] or the [Eugene Myers diff][eugene-myers-diff].
|
||||
While the latter offer the advantage of readability in terms of knowing what has
|
||||
been added or deleted between two files, they are not necessarily optimal for
|
||||
producing a _small_ patch.
|
||||
|
||||
In reality, what interests us in the case of the storage and transmission of
|
||||
these patches over the network is not a form of lisibility in these patches but
|
||||
an optimality in what can be considered as common between two files and what is
|
||||
not. It is at this stage that the use of [duff][duff] is introduced.
|
||||
|
||||
This is a small library which can generate a patch between two files according
|
||||
to the series of bytes common to both files. We're talking about ‘series of
|
||||
bytes’ here because these elements common to our two files are not necessary
|
||||
humanly readable. To find these series of common bytes, we use [Rabin's
|
||||
fingerprint][rabin] algorithm: [a rolling hash][rolling-hash] used since time
|
||||
immemorial.
|
||||
|
||||
### Patches and emails
|
||||
|
||||
So, as far as emails are concerned, it's fairly obvious that there are many
|
||||
common "words" to all your emails. The simple word `From:` should exist in all
|
||||
your emails.
|
||||
|
||||
From this simple idea, we can understand the impact, the headers of your emails
|
||||
are more or less similar and have more or less the same content. The idea of
|
||||
`duff`, applied to your emails, is to consider these other emails as a
|
||||
"slightly" different version of your first email.
|
||||
1) we can store a single raw email
|
||||
2) and we build patches of all your other emails from this first one
|
||||
|
||||
A fairly concrete example of this compression through patches and emails is 2
|
||||
notification emails from GitHub: these are quite similar, particularly in the
|
||||
header. Even the content is just as similar: the HTML remains the same, only
|
||||
the commentary differs.
|
||||
|
||||
```shell
|
||||
$ carton diff github.01.eml github.02.eml -o patch.diff
|
||||
$ du -sb github.01.eml github.02.eml patch.diff
|
||||
9239 github.01.eml
|
||||
9288 github.02.eml
|
||||
5136 patch.diff
|
||||
```
|
||||
|
||||
This example shows that our patch for rebuilding `github.02.eml` from
|
||||
`github.01.eml` is almost 2 times smaller in size. In this case, with the PACK
|
||||
format, this patch will also be compressed with zlib (and we can reach ~2900
|
||||
bytes, so 3 times smaller).
|
||||
|
||||
#### Compress and compress!
|
||||
|
||||
To put this into perspective, a compression algorithm like zlib can also have
|
||||
such a ratio (3 times smaller). But the latter also needs to serialise the
|
||||
Huffman tree required for compression (in the general case). What can be
|
||||
observed is that concatenating separately compressed emails makes it difficult
|
||||
to maintain such a ratio. Worse, concatenating all the emails before compression
|
||||
and compressing them all gives us a better ratio!
|
||||
|
||||
That's what the PACK file is all about: the aim is to be able to concatenate
|
||||
these compressed emails and keep an interesting overall compression ratio. This
|
||||
is the reason for the patch, to reduce the objects even further so that the
|
||||
impact of the zlib _header_ on all our objects is minimal and, above
|
||||
all, so that we can access to objects **without** having to decompress the
|
||||
previous ones (as we would have to do for a `*.tar.gz` archive, for example).
|
||||
|
||||
The initial intuition about the emails was right, they do indeed share quite a
|
||||
few elements together and in the end we were able to save ~4000 bytes in our
|
||||
example GitHub notification example.
|
||||
|
||||
## Isomorphism, DKIM and ARC
|
||||
|
||||
One attribute that we wanted to pay close attention to throughout our
|
||||
experimentation was "isomorphism". This property is very simple: imagine a
|
||||
function that takes an email as input and transforms it into another value
|
||||
using a method (such as compression). Isomorphism ensures that we can ‘undo’
|
||||
this method and obtain exactly the same result again:
|
||||
|
||||
```
|
||||
x == decode(encode(x))
|
||||
```
|
||||
|
||||
This property is very important for emails because signatures exist in your
|
||||
email and these signatures result from the content of your email. If the email
|
||||
changes, these signatures change too.
|
||||
|
||||
For instance, the DKIM signature allows you to sign an email and check its
|
||||
integrity on receipt. ARC (which will be our next objective) also signs your
|
||||
emails, but goes one step further: all the relays that receive your email and
|
||||
send it back to the real destination must add a new ARC signature, just like
|
||||
adding a new block to the Bitcoin blockchain.
|
||||
|
||||
So you need to make sure that the way you serialise your email (in a PACK file)
|
||||
doesn't alter the content in order to keep these signatures valid! It just so
|
||||
happens that here too we have a lot of experience with Git. Git has the same
|
||||
constraint with [Merkle-Trees][merkle-tree] and as far as we're concerned, we've
|
||||
developed a library that allows you to generate an encoder and a decoder from a
|
||||
description and that respects the isomorphism property _by construction_: the
|
||||
[encore][encore] library.
|
||||
|
||||
Then we could store our emails as they are in the PACK file. However, the
|
||||
advantage of `duff` really comes into play when several objects are similar. In
|
||||
the case of Git, tree objects are similar but they are not similar with
|
||||
commits, for example. For emails, there is also such a distinction: the email
|
||||
headers are similar but they are not similar to the email content.
|
||||
|
||||
You can therefore try to "split" emails into 2 parts, the header on one side and
|
||||
the content on the other. We would then have a third value which would tell us
|
||||
how to reconstruct our complete email (i.e. identify where the header is and
|
||||
identify where the content is).
|
||||
|
||||
However, after years of reading email RFCs, things are much more complex. Above
|
||||
all, this experience has enabled me to synthesise a skeleton that all emails
|
||||
have:
|
||||
|
||||
```ocaml
|
||||
(* multipart-body :=
|
||||
[preamble CRLF]
|
||||
--boundary transport-padding CRLF
|
||||
part
|
||||
( CRLF --boundary transport-padding CRLF part )*
|
||||
CRLF
|
||||
--boundary-- transport-padding
|
||||
[CRLF epilogue]
|
||||
|
||||
part := headers ( CRLF body )?
|
||||
*)
|
||||
|
||||
type 'octet body =
|
||||
| Multipart of 'octet multipart
|
||||
| Single of 'octet option
|
||||
| Message of 'octet t
|
||||
|
||||
and 'octet part = { headers : 'octet; body : 'octet body }
|
||||
|
||||
and 'octet multipart =
|
||||
{ preamble : string
|
||||
; epilogue : string * transport_padding;
|
||||
; boundary : string
|
||||
; parts : (transport_padding * 'octet part) list }
|
||||
|
||||
and 'octet t = 'octet part
|
||||
```
|
||||
|
||||
As you can see, the distinction is not only between the header and the content
|
||||
but also between the parts of an email as soon as it has an attachment. You can
|
||||
also have an email inside an email (and I'm always surprised to see that this
|
||||
particular case is _frequent_). Finally, there's the annoying _preamble_ and
|
||||
_epilogue_ of an email with several parts, which is often empty but necessary:
|
||||
you always have to ensure isomorphism — even for "useless" bytes, they count for
|
||||
signatures.
|
||||
|
||||
We'll therefore need to serialise this structure and all we have to do is
|
||||
transform a `string t` and `SHA1.t t` so that our structure no longer contains
|
||||
the actual content of our emails but a unique identifier referring to this
|
||||
content and which will be available in our PACK file.
|
||||
|
||||
```ocaml
|
||||
module Format : sig
|
||||
val t : SHA1.t Encore.t
|
||||
end
|
||||
|
||||
let decode =
|
||||
let parser = Encore.to_angstrom Format.t in
|
||||
Angstrom.parse_string ~consume:All parser str
|
||||
|
||||
let encode =
|
||||
let emitter = Encore.to_lavoisier Format.t in
|
||||
Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter
|
||||
```
|
||||
|
||||
However, we need to check that the isomorphism is respected. You should be
|
||||
aware that work on [Mr. MIME][mrmime] has already been done on this subject with
|
||||
the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. So we went
|
||||
one step further and generated random emails from this fuzzer! This allows me to
|
||||
reintroduce you to the [hamlet][hamlet] project, perhaps the biggest database of
|
||||
valid — but incomprehensible — emails. In this case, Mr. MIME does more than just
|
||||
encode/decode as we want to do here, it also parses email addresses, dates, and
|
||||
so on. Here, the aim is mainly to split our emails. We therefore took the time
|
||||
to check the isomorphism on our `hamlet` database: the result is that among
|
||||
these 1M emails, not one has been altered!
|
||||
|
||||
## Carton, POP3 & mbox, some metrics
|
||||
|
||||
We can therefore split an email into several parts and calculate an optimal
|
||||
patch between two similar pieces of content. So now you can start packaging!
|
||||
This is where I'm going to reintroduce a tool that hasn't been released yet,
|
||||
but which allows me to go even further with emails: [blaze][blaze].
|
||||
|
||||
This little tool is my _Swiss army knife_ for emails! And it's in this tool
|
||||
that we're going to have fun deriving Carton so that it can manipulate emails
|
||||
rather than Git objects. So we've implemented the very basic [POP3][pop3]
|
||||
protocol (and thanks to [ocaml-tls][tls] for offering a free encrypted
|
||||
connection) as well as the [mbox][mbox] format.
|
||||
|
||||
Both are **not** recommended. The first is an old protocol and interacting with
|
||||
Gmail, for example, is very slow. The second is an old, non-standardised format
|
||||
for storing your emails — and unfortunately this may be the format used by your
|
||||
email client. After resolving a few bugs such as the unspecified behaviour of
|
||||
pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up
|
||||
with lots of emails that you'll have fun packaging!
|
||||
|
||||
```shell
|
||||
$ mkdir mailbox
|
||||
$ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \
|
||||
-u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' > mails.lst
|
||||
$ blaze.pack make -o mailbox.pack mails.lst
|
||||
$ tar czf mailbox.tar.gz mailbox
|
||||
$ du -sh mailbox mailbox.pack mailbox.tar.gz
|
||||
97M mailbox
|
||||
28M mailbox.pack
|
||||
23M mailbox.tar.gz
|
||||
```
|
||||
|
||||
In this example, we download the latest emails from the last 30 days via POP3
|
||||
and store them in the `mailbox/` folder. This folder weighs 97M and if we
|
||||
compress it with gzip, we end up with 23M. The problem is that we need to
|
||||
decompress the `mailbox.tar.gz` document to extract the emails.
|
||||
|
||||
This is where the PACK file comes in handy: it only weighs 28M (so we're very
|
||||
close to what `tar` and `gzip` can do) but we can rebuild our emails without
|
||||
unpacking everything:
|
||||
|
||||
```shell
|
||||
$ blaze.pack index mailbox.pack
|
||||
$ blaze.pack list mailbox.pack | head -n1
|
||||
0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33
|
||||
$ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33
|
||||
Delivered-To: romain.calascibetta@gmail.com
|
||||
...
|
||||
```
|
||||
|
||||
Like Git, we now associate a hash with our emails and can retrieve them using
|
||||
this hash. Like Git, we also calculate the `*.idx` file to associate the hash
|
||||
with the position of the email in our PACK file. Just like Git (with `git show`
|
||||
or `git cat-file`), we can now access our emails very quickly. So we now have a
|
||||
database system (read-only) for our emails: we can now archive our emails!
|
||||
|
||||
Let's have a closer look at this PACK file. We've developed a tool more or less
|
||||
similar to `git verify-pack` which lists all the objects in our PACK file and,
|
||||
above all, gives us information such as the number of bytes needed to store
|
||||
these objects:
|
||||
|
||||
```shell
|
||||
$ blaze.pack verify mailbox.pack
|
||||
4e9795e268313245f493d9cef1b5ccf30cc92c33 a 12 186 6257b7d4
|
||||
...
|
||||
517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b 666027 223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a
|
||||
```
|
||||
|
||||
It shows the hash of our object, its type (A for the structure of our email, B
|
||||
for the content), its position in the PACK file, the number of bytes used to
|
||||
store the object (!) and finally the depth of the patch, the checksum, and the
|
||||
source of the patch needed to rebuild the object.
|
||||
|
||||
Here, our first object is not patched, but the next object is. Note that it
|
||||
only needs 223 bytes in the PACK file. But what is the real size of this
|
||||
object?
|
||||
|
||||
```shell
|
||||
$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \
|
||||
--raw --without-metadata | wc -c
|
||||
2014
|
||||
```
|
||||
|
||||
So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of
|
||||
10! In this case, the object is the content of an email. Guess which one? A
|
||||
GitHub notification! If we go back to our very first example, we saw that we
|
||||
could compress with a ratio close to 2. We could go further with zlib: we
|
||||
compress the patch too. This example allows us to introduce one last feature of
|
||||
PACK files: the depth.
|
||||
|
||||
```shell
|
||||
$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8
|
||||
kind: b
|
||||
length: 2014 byte(s)
|
||||
depth: 10
|
||||
cache misses: 586
|
||||
cache hits: 0
|
||||
tree: 000026ab
|
||||
Δ 00007f78
|
||||
...
|
||||
Δ 0009ef74
|
||||
Δ 000a29ab
|
||||
...
|
||||
```
|
||||
|
||||
In our example, our object requires a source which, in turn, is a patch
|
||||
requiring another source, and so on (you can see this chain in the `tree`).
|
||||
The length of this patch chain corresponds to the depth of our object. There is
|
||||
therefore a succession of patches between objects. What Carton tries to do is
|
||||
to find the best patch from a window of possibilities and keep the best
|
||||
candidates for reuse. If we unroll this chain of patches, we find a "base"
|
||||
object (at `0x000026ab`) that is just compressed with zlib and the latter is
|
||||
also the content of another GitHub notification email: this checks that our
|
||||
Rabin's fingerprinting algorithm works very well.
|
||||
|
||||
### Mbox and real emails
|
||||
|
||||
In a way, the concrete cases we use here are my emails. There may be a fairly
|
||||
simple bias, which is that all these emails have the same destination:
|
||||
romain.calascibetta@gmail.com. This is a common point which can also have a
|
||||
significant impact on compression with `duff`. We will therefore try another
|
||||
corpus, the archives of certain mailing lists relating to OCaml:
|
||||
[lists.ocaml.org-archive](https://github.com/ocaml/lists.ocaml.org-archive)
|
||||
|
||||
```shell
|
||||
$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
|
||||
-o opam-devel.pack
|
||||
$ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
|
||||
> opam-devel.mbox.gzip
|
||||
$ du -sh opam-devel.pack opam-devel.mbox.gzip \
|
||||
lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
|
||||
3.9M opam-devel.pack
|
||||
2.0M opam-devel.mbox.gzip
|
||||
10M lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
|
||||
```
|
||||
|
||||
The compression ratio is a bit worse than before, but we're still on to
|
||||
something interesting. Here again we can take an object from our PACK file and
|
||||
see how the compression between objects reacts:
|
||||
|
||||
```shell
|
||||
$ blaze.pack index opam-devel.pack
|
||||
...
|
||||
09bbd28303c8aafafd996b56f9c071a3add7bd92 b 362504 271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35
|
||||
$ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
|
||||
--without-metadata --raw | wc -c
|
||||
2098
|
||||
```
|
||||
|
||||
Once again, we see a ratio of 10! Here the object corresponds to the header of
|
||||
an email. This is compressed with other email headers. This is the situation
|
||||
where the fields are common to all your emails (`From`, `Subject`, etc.).
|
||||
|
||||
## Next things
|
||||
|
||||
All the work done on email archiving is aimed at producing a unikernel (`void`)
|
||||
that could archive all incoming emails. Finally, this unikernel could send the
|
||||
archive back (via an email!) to those who want it. This is one of our goals for
|
||||
implementing a mailing list in OCaml with unikernels.
|
||||
|
||||
Another objective is to create a database system for emails in order to offer
|
||||
two features to the user that we consider important:
|
||||
- quick and easy access to emails
|
||||
- save disk space through compression
|
||||
|
||||
With this system, we can extend the method of indexing emails with other
|
||||
information such as the keywords found in the emails. This will enable us,
|
||||
among other things, to create an email search engine!
|
||||
|
||||
## Conclusion
|
Loading…
Reference in a new issue