Pushed by YOCaml 2 from dd0fc6e95e
This commit is contained in:
parent
4392088d5a
commit
a3cbc49867
1 changed files with 391 additions and 0 deletions
391
articles/2025-01-07-carton-and-cachet.html
Normal file
391
articles/2025-01-07-carton-and-cachet.html
Normal file
|
@ -0,0 +1,391 @@
|
||||||
|
<!doctype html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="utf-8">
|
||||||
|
<meta http-equiv="x-ua-compatible" content="ie=edge">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1">
|
||||||
|
<title>
|
||||||
|
Robur's blog - Git, Carton and emails
|
||||||
|
</title>
|
||||||
|
<meta name="description" content="A way to store and archive your emails">
|
||||||
|
<link type="text/css" rel="stylesheet" href="/css/hl.css">
|
||||||
|
<link type="text/css" rel="stylesheet" href="/css/style.css">
|
||||||
|
<script src="/js/hl.js"></script>
|
||||||
|
<link rel="alternate" type="application/rss+xml" href="/feed.xml" title="blog.robur.coop">
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<header>
|
||||||
|
<h1>blog.robur.coop</h1>
|
||||||
|
<blockquote>
|
||||||
|
The <strong>Robur</strong> cooperative blog.
|
||||||
|
</blockquote>
|
||||||
|
</header>
|
||||||
|
<main><a href="/index.html">Back to index</a>
|
||||||
|
|
||||||
|
<article>
|
||||||
|
<h1>Git, Carton and emails</h1>
|
||||||
|
<ul class="tags-list"><li><a href="/tags.html#tag-emails">emails</a></li><li><a href="/tags.html#tag-storage">storage</a></li><li><a href="/tags.html#tag-Git">Git</a></li></ul><p>We are pleased to announce the release of Carton 1.0.0 and Cachet. You can have
|
||||||
|
an overview of these libraries in our announcement on the OCaml forum. This
|
||||||
|
article goes into more detail about the PACK format and its use for archiving
|
||||||
|
your emails.</p>
|
||||||
|
<h2 id="back-to-git-and-patches"><a class="anchor" aria-hidden="true" href="#back-to-git-and-patches"></a>Back to Git and patches</h2>
|
||||||
|
<p>In our Carton annoucement, we talk about 2 levels of compression for Git
|
||||||
|
objects, which are zlib compression and compression between objects using a
|
||||||
|
patch.</p>
|
||||||
|
<p>Furthermore, if we have 2 blobs (2 versions of a file), one of which contains
|
||||||
|
'A' and the other contains 'A+B', the second blob will probably be saved in the
|
||||||
|
form of a patch requiring the contents of the first blob and adding '+B'. At a
|
||||||
|
higher level and according to our use of Git, we understand that this second
|
||||||
|
level of compression is very interesting: we generally just add/remove few lines
|
||||||
|
(like introduce a new function) or delete some (removing code) in our files of
|
||||||
|
our project.</p>
|
||||||
|
<p>However, there is a bias in what Git does and what we perceive. We often think
|
||||||
|
that when it comes to patching in the case of Git, we think of the
|
||||||
|
<a href="https://opensource.janestreet.com/patdiff/">patience diff</a> or the <a href="https://www.nathaniel.ai/myers-diff/">Eugene Myers diff</a>.
|
||||||
|
While the latter offer the advantage of readability in terms of knowing what has
|
||||||
|
been added or deleted between two files, they are not necessarily optimal for
|
||||||
|
producing a <em>small</em> patch.</p>
|
||||||
|
<p>In reality, what interests us in the case of the storage and transmission of
|
||||||
|
these patches over the network is not a form of readability in these patches but
|
||||||
|
an optimality in what can be considered as common between two files and what is
|
||||||
|
not. It is at this stage that the use of <a href="https://github.com/mirage/duff">duff</a> is introduced.</p>
|
||||||
|
<p>This is a small library which can generate a patch between two files according
|
||||||
|
to the series of bytes common to both files. We're talking about 'series of
|
||||||
|
bytes' here because these elements common to our two files are not necessary
|
||||||
|
human readable. To find these series of common bytes, we use <a href="https://en.wikipedia.org/wiki/Rabin_fingerprint">Rabin's
|
||||||
|
fingerprint</a> algorithm: <a href="https://en.wikipedia.org/wiki/Rolling_hash">a rolling hash</a> used since time
|
||||||
|
immemorial.</p>
|
||||||
|
<h3 id="patches-and-emails"><a class="anchor" aria-hidden="true" href="#patches-and-emails"></a>Patches and emails</h3>
|
||||||
|
<p>So, as far as emails are concerned, it's fairly obvious that there are many
|
||||||
|
common "words" to all your emails. The simple word <code>From:</code> should exist in all
|
||||||
|
your emails.</p>
|
||||||
|
<p>From this simple idea, we can understand the impact, the headers of your emails
|
||||||
|
are more or less similar and have more or less the same content. The idea of
|
||||||
|
<code>duff</code>, applied to your emails, is to consider these other emails as a
|
||||||
|
"slightly" different version of your first email.</p>
|
||||||
|
<ol>
|
||||||
|
<li>we can store a single raw email</li>
|
||||||
|
<li>and we build patches of all your other emails from this first one</li>
|
||||||
|
</ol>
|
||||||
|
<p>A fairly concrete example of this compression through patches and emails is 2
|
||||||
|
notification emails from GitHub: these are quite similar, particularly in the
|
||||||
|
header. Even the content is just as similar: the HTML remains the same, only
|
||||||
|
the commentary differs.</p>
|
||||||
|
<pre><code class="language-shell">$ carton diff github.01.eml github.02.eml -o patch.diff
|
||||||
|
$ du -sb github.01.eml github.02.eml patch.diff
|
||||||
|
9239 github.01.eml
|
||||||
|
9288 github.02.eml
|
||||||
|
5136 patch.diff
|
||||||
|
</code></pre>
|
||||||
|
<p>This example shows that our patch for rebuilding <code>github.02.eml</code> from
|
||||||
|
<code>github.01.eml</code> is almost 2 times smaller in size. In this case, with the PACK
|
||||||
|
format, this patch will also be compressed with zlib (and we can reach ~2900
|
||||||
|
bytes, so 3 times smaller).</p>
|
||||||
|
<h4 id="compress-and-compress"><a class="anchor" aria-hidden="true" href="#compress-and-compress"></a>Compress and compress!</h4>
|
||||||
|
<p>To put this into perspective, a compression algorithm like zlib can also have
|
||||||
|
such a ratio (3 times smaller). But the latter also needs to serialise the
|
||||||
|
Huffman tree required for compression (in the general case). What can be
|
||||||
|
observed is that concatenating separately compressed emails makes it difficult
|
||||||
|
to maintain such a ratio. Worse, concatenating all the emails before compression
|
||||||
|
and compressing them all gives us a better ratio!</p>
|
||||||
|
<p>That's what the PACK file is all about: the aim is to be able to concatenate
|
||||||
|
these compressed emails and keep an interesting overall compression ratio. This
|
||||||
|
is the reason for the patch, to reduce the objects even further so that the
|
||||||
|
impact of the zlib <em>header</em> on all our objects is minimal and, above
|
||||||
|
all, so that we can access to objects <strong>without</strong> having to decompress the
|
||||||
|
previous ones (as we would have to do for a <code>*.tar.gz</code> archive, for example).</p>
|
||||||
|
<p>The initial intuition about the emails was right, they do indeed share quite a
|
||||||
|
few elements together and in the end we were able to save ~4000 bytes in our
|
||||||
|
GitHub notification example.</p>
|
||||||
|
<h2 id="isomorphism-dkim-and-arc"><a class="anchor" aria-hidden="true" href="#isomorphism-dkim-and-arc"></a>Isomorphism, DKIM and ARC</h2>
|
||||||
|
<p>One attribute that we wanted to pay close attention to throughout our
|
||||||
|
experimentation was "isomorphism". This property is very simple: imagine a
|
||||||
|
function that takes an email as input and transforms it into another value
|
||||||
|
using a method (such as compression). Isomorphism ensures that we can 'undo'
|
||||||
|
this method and obtain exactly the same result again:</p>
|
||||||
|
<pre><code> decode(encode(x)) == x
|
||||||
|
</code></pre>
|
||||||
|
<p>This property is very important for emails because signatures exist in your
|
||||||
|
email and these signatures result from the content of your email. If the email
|
||||||
|
changes, these signatures change too.</p>
|
||||||
|
<p>For instance, the DKIM signature allows you to sign an email and check its
|
||||||
|
integrity on receipt. ARC (which will be our next objective) also signs your
|
||||||
|
emails, but goes one step further: all the relays that receive your email and
|
||||||
|
send it back to the real destination must add a new ARC signature, just like
|
||||||
|
adding a new block to the Bitcoin blockchain.</p>
|
||||||
|
<p>So you need to make sure that the way you serialise your email (in a PACK file)
|
||||||
|
doesn't alter the content in order to keep these signatures valid! It just so
|
||||||
|
happens that here too we have a lot of experience with Git. Git has the same
|
||||||
|
constraint with <a href="https://en.wikipedia.org/wiki/Merkle_tree">Merkle-Trees</a> and as far as we're concerned, we've
|
||||||
|
developed a library that allows you to generate an encoder and a decoder from a
|
||||||
|
description and that respects the isomorphism property <em>by construction</em>: the
|
||||||
|
<a href="https://github.com/mirage/encore">encore</a> library.</p>
|
||||||
|
<p>Then we could store our emails as they are in the PACK file. However, the
|
||||||
|
advantage of <code>duff</code> really comes into play when several objects are similar. In
|
||||||
|
the case of Git, tree objects are similar but they are not similar with
|
||||||
|
commits, for example. For emails, there is also such a distinction: the email
|
||||||
|
headers are similar but they are not similar to the email content.</p>
|
||||||
|
<p>You can therefore try to "split" emails into 2 parts, the header on one side and
|
||||||
|
the content on the other. We would then have a third value which would tell us
|
||||||
|
how to reconstruct our complete email (i.e. identify where the header is and
|
||||||
|
identify where the content is).</p>
|
||||||
|
<p>However, after years of reading email RFCs, things are much more complex. Above
|
||||||
|
all, this experience has enabled me to synthesise a skeleton that all emails
|
||||||
|
have:</p>
|
||||||
|
<pre><code class="language-ocaml">(* multipart-body :=
|
||||||
|
[preamble CRLF]
|
||||||
|
--boundary transport-padding CRLF
|
||||||
|
part
|
||||||
|
( CRLF --boundary transport-padding CRLF part )*
|
||||||
|
CRLF
|
||||||
|
--boundary-- transport-padding
|
||||||
|
[CRLF epilogue]
|
||||||
|
|
||||||
|
part := headers ( CRLF body )?
|
||||||
|
*)
|
||||||
|
|
||||||
|
type 'octet body =
|
||||||
|
| Multipart of 'octet multipart
|
||||||
|
| Single of 'octet option
|
||||||
|
| Message of 'octet t
|
||||||
|
|
||||||
|
and 'octet part = { headers : 'octet; body : 'octet body }
|
||||||
|
|
||||||
|
and 'octet multipart =
|
||||||
|
{ preamble : string
|
||||||
|
; epilogue : string * transport_padding;
|
||||||
|
; boundary : string
|
||||||
|
; parts : (transport_padding * 'octet part) list }
|
||||||
|
|
||||||
|
and 'octet t = 'octet part
|
||||||
|
</code></pre>
|
||||||
|
<p>As you can see, the distinction is not only between the header and the content
|
||||||
|
but also between the parts of an email as soon as it has an attachment. You can
|
||||||
|
also have an email inside an email (and I'm always surprised to see that this
|
||||||
|
particular case is <em>frequent</em>). Finally, there's the annoying <em>preamble</em> and
|
||||||
|
<em>epilogue</em> of an email with several parts, which is often empty but necessary:
|
||||||
|
you always have to ensure isomorphism — even for "useless" bytes, they count for
|
||||||
|
signatures.</p>
|
||||||
|
<p>We'll therefore need to serialise this structure and all we have to do is
|
||||||
|
transform a <code>string t</code> and <code>SHA1.t t</code> so that our structure no longer contains
|
||||||
|
the actual content of our emails but a unique identifier referring to this
|
||||||
|
content and which will be available in our PACK file.</p>
|
||||||
|
<pre><code class="language-ocaml">module Format : sig
|
||||||
|
val t : SHA1.t Encore.t
|
||||||
|
end
|
||||||
|
|
||||||
|
let decode =
|
||||||
|
let parser = Encore.to_angstrom Format.t in
|
||||||
|
Angstrom.parse_string ~consume:All parser str
|
||||||
|
|
||||||
|
let encode =
|
||||||
|
let emitter = Encore.to_lavoisier Format.t in
|
||||||
|
Encore.Lavoisier.emit_string ~chunk:0x7ff t emitter
|
||||||
|
</code></pre>
|
||||||
|
<p>However, we need to check that the isomorphism is respected. You should be
|
||||||
|
aware that work on <a href="https://github.com/mirage/mrmime">Mr. MIME</a> has already been done on this subject with
|
||||||
|
the <a href="https://afl-1.readthedocs.io/en/latest/fuzzing.html">afl</a> fuzzer: check our assertion <code>x == decode(encode(x))</code>. This
|
||||||
|
ability to check isomorphism using afl has enabled us to use the latter to
|
||||||
|
generate valid random emails. This allows me to reintroduce you to the
|
||||||
|
<a href="https://github.com/mirage/hamlet">hamlet</a> project, perhaps the biggest database of valid — but
|
||||||
|
incomprehensible — emails. So we've checked that our encoder/decoder for
|
||||||
|
“splitting” our emails respects isomophism on this million emails.</p>
|
||||||
|
<h2 id="carton-pop3--mbox-some-metrics"><a class="anchor" aria-hidden="true" href="#carton-pop3--mbox-some-metrics"></a>Carton, POP3 & mbox, some metrics</h2>
|
||||||
|
<p>We can therefore split an email into several parts and calculate an optimal
|
||||||
|
patch between two similar pieces of content. So now you can start packaging!
|
||||||
|
This is where I'm going to reintroduce a tool that hasn't been released yet,
|
||||||
|
but which allows me to go even further with emails: <a href="https://github.com/dinosaure/blaze">blaze</a>.</p>
|
||||||
|
<p>This little tool is my <em>Swiss army knife</em> for emails! And it's in this tool
|
||||||
|
that we're going to have fun deriving Carton so that it can manipulate emails
|
||||||
|
rather than Git objects. So we've implemented the very basic <a href="https://en.wikipedia.org/wiki/Post_Office_Protocol">POP3</a>
|
||||||
|
protocol (and thanks to <a href="https://github.com/mirleft/ocaml-tls">ocaml-tls</a> for offering a free encrypted
|
||||||
|
connection) as well as the <a href="https://en.wikipedia.org/wiki/Mbox">mbox</a> format.</p>
|
||||||
|
<p>Both are <strong>not</strong> recommended. The first is an old protocol and interacting with
|
||||||
|
Gmail, for example, is very slow. The second is an old, non-standardised format
|
||||||
|
for storing your emails — and unfortunately this may be the format used by your
|
||||||
|
email client. After resolving a few bugs such as the unspecified behaviour of
|
||||||
|
pop.gmail.com and the mix of CRLF and LF in the mbox format... You'll end up
|
||||||
|
with lots of emails that you'll have fun packaging!</p>
|
||||||
|
<pre><code class="language-shell">$ mkdir mailbox
|
||||||
|
$ blaze.fetch pop3://pop.gmail.com -p $(cat password.txt) \
|
||||||
|
-u recent:romain.calascibetta@gmail.com -f 'mailbox/%s.eml' > mails.lst
|
||||||
|
$ blaze.pack make -o mailbox.pack mails.lst
|
||||||
|
$ tar czf mailbox.tar.gz mailbox
|
||||||
|
$ du -sh mailbox mailbox.pack mailbox.tar.gz
|
||||||
|
97M mailbox
|
||||||
|
28M mailbox.pack
|
||||||
|
23M mailbox.tar.gz
|
||||||
|
</code></pre>
|
||||||
|
<p>In this example, we download the latest emails from the last 30 days via POP3
|
||||||
|
and store them in the <code>mailbox/</code> folder. This folder weighs 97M and if we
|
||||||
|
compress it with gzip, we end up with 23M. The problem is that we need to
|
||||||
|
decompress the <code>mailbox.tar.gz</code> document to extract the emails.</p>
|
||||||
|
<p>This is where the PACK file comes in handy: it only weighs 28M (so we're very
|
||||||
|
close to what <code>tar</code> and <code>gzip</code> can do) but we can rebuild our emails without
|
||||||
|
unpacking everything:</p>
|
||||||
|
<pre><code class="language-shell">$ blaze.pack index mailbox.pack
|
||||||
|
$ blaze.pack list mailbox.pack | head -n1
|
||||||
|
0000000c 4e9795e268313245f493d9cef1b5ccf30cc92c33
|
||||||
|
$ blaze.pack get mailbox.idx 4e9795e268313245f493d9cef1b5ccf30cc92c33
|
||||||
|
Delivered-To: romain.calascibetta@gmail.com
|
||||||
|
...
|
||||||
|
</code></pre>
|
||||||
|
<p>Like Git, we now associate a hash with our emails and can retrieve them using
|
||||||
|
this hash. Like Git, we also calculate the <code>*.idx</code> file to associate the hash
|
||||||
|
with the position of the email in our PACK file. Just like Git (with <code>git show</code>
|
||||||
|
or <code>git cat-file</code>), we can now access our emails very quickly. So we now have a
|
||||||
|
database system (read-only) for our emails: we can now archive our emails!</p>
|
||||||
|
<p>Let's have a closer look at this PACK file. We've developed a tool more or less
|
||||||
|
similar to <code>git verify-pack</code> which lists all the objects in our PACK file and,
|
||||||
|
above all, gives us information such as the number of bytes needed to store
|
||||||
|
these objects:</p>
|
||||||
|
<pre><code class="language-shell">$ blaze.pack verify mailbox.pack
|
||||||
|
4e9795e268313245f493d9cef1b5ccf30cc92c33 a 12 186 6257b7d4
|
||||||
|
...
|
||||||
|
517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 b 666027 223 10 e8e534a6 cedfaf6dc22f3875ae9d4046ea2a51b9d5c6597a
|
||||||
|
</code></pre>
|
||||||
|
<p>It shows the hash of our object, its type (A for the structure of our email, B
|
||||||
|
for the content), its position in the PACK file, the number of bytes used to
|
||||||
|
store the object (!) and finally the depth of the patch, the checksum, and the
|
||||||
|
source of the patch needed to rebuild the object.</p>
|
||||||
|
<p>Here, our first object is not patched, but the next object is. Note that it
|
||||||
|
only needs 223 bytes in the PACK file. But what is the real size of this
|
||||||
|
object?</p>
|
||||||
|
<pre><code class="language-shell">$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8 \
|
||||||
|
--raw --without-metadata | wc -c
|
||||||
|
2014
|
||||||
|
</code></pre>
|
||||||
|
<p>So we've gone from 2014 bytes to 223 bytes! That's almost a compression ratio of
|
||||||
|
10! In this case, the object is the content of an email. Guess which one? A
|
||||||
|
GitHub notification! If we go back to our very first example, we saw that we
|
||||||
|
could compress with a ratio close to 2. We could go further with zlib: we
|
||||||
|
compress the patch too. This example allows us to introduce one last feature of
|
||||||
|
PACK files: the depth.</p>
|
||||||
|
<pre><code class="language-shell">$ carton get mailbox.idx 517ccbc063d27dbd87122380c9cdaaadc9c4a1d8
|
||||||
|
kind: b
|
||||||
|
length: 2014 byte(s)
|
||||||
|
depth: 10
|
||||||
|
cache misses: 586
|
||||||
|
cache hits: 0
|
||||||
|
tree: 000026ab
|
||||||
|
Δ 00007f78
|
||||||
|
...
|
||||||
|
Δ 0009ef74
|
||||||
|
Δ 000a29ab
|
||||||
|
...
|
||||||
|
</code></pre>
|
||||||
|
<p>In our example, our object requires a source which, in turn, is a patch
|
||||||
|
requiring another source, and so on (you can see this chain in the <code>tree</code>).
|
||||||
|
The length of this patch chain corresponds to the depth of our object. There is
|
||||||
|
therefore a succession of patches between objects. What Carton tries to do is
|
||||||
|
to find the best patch from a window of possibilities and keep the best
|
||||||
|
candidates for reuse. If we unroll this chain of patches, we find a "base"
|
||||||
|
object (at <code>0x000026ab</code>) that is just compressed with zlib and the latter is
|
||||||
|
also the content of another GitHub notification email. This shows that Carton
|
||||||
|
is well on its way to finding the best candidate for the patch, which should be
|
||||||
|
similar content, moreover, another GitHub notification.</p>
|
||||||
|
<p>The idea is to sacrifice a little computing time (in the reconstruction of
|
||||||
|
objects via their patches) to gain in compression ratio. It's fair to say that
|
||||||
|
a very long patch chain can degrade performance. However, there is a limit in
|
||||||
|
Git and Carton: a chain can't be longer than 50. Another point is the search for
|
||||||
|
the candidate source for the patch, which is often physically close to the patch
|
||||||
|
(within a few bytes): reading the PACK file by page (thanks to [Cachet][cachet])
|
||||||
|
sometimes gives access to 3 or 4 objects, which have a certain chance of being
|
||||||
|
patched together.</p>
|
||||||
|
<p>Let's take the example of Carton and a Git object:</p>
|
||||||
|
<pre><code class="language-shell">$ carton get pack-*.idx eaafd737886011ebc28e6208e03767860c22e77d
|
||||||
|
...
|
||||||
|
cache misses: 62
|
||||||
|
cache hits: 758
|
||||||
|
tree: 160720bb
|
||||||
|
Δ 160ae4bc
|
||||||
|
Δ 160ae506
|
||||||
|
Δ 160ae575
|
||||||
|
Δ 160ae5be
|
||||||
|
Δ 160ae5fc
|
||||||
|
Δ 160ae62f
|
||||||
|
Δ 160ae667
|
||||||
|
Δ 160ae6a5
|
||||||
|
Δ 160ae6db
|
||||||
|
Δ 160ae72a
|
||||||
|
Δ 160ae766
|
||||||
|
Δ 160ae799
|
||||||
|
Δ 160ae81e
|
||||||
|
Δ 160ae858
|
||||||
|
Δ 16289943
|
||||||
|
</code></pre>
|
||||||
|
<p>We can see here that we had to load 62 pages, but that we also reused the pages
|
||||||
|
we'd already read 758 times. We can also see that the offset of the patches
|
||||||
|
(which can be seen in Tree) is always close (the objects often follow each
|
||||||
|
other).</p>
|
||||||
|
<h3 id="mbox-and-real-emails"><a class="anchor" aria-hidden="true" href="#mbox-and-real-emails"></a>Mbox and real emails</h3>
|
||||||
|
<p>In a way, the concrete cases we use here are my emails. There may be a fairly
|
||||||
|
simple bias, which is that all these emails have the same destination:
|
||||||
|
romain.calascibetta@gmail.com. This is a common point which can also have a
|
||||||
|
significant impact on compression with <code>duff</code>. We will therefore try another
|
||||||
|
corpus, the archives of certain mailing lists relating to OCaml:
|
||||||
|
<a href="https://github.com/ocaml/lists.ocaml.org-archive">lists.ocaml.org-archive</a></p>
|
||||||
|
<pre><code class="language-shell">$ blaze.mbox lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
|
||||||
|
-o opam-devel.pack
|
||||||
|
$ gzip -c lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox \
|
||||||
|
> opam-devel.mbox.gzip
|
||||||
|
$ du -sh opam-devel.pack opam-devel.mbox.gzip \
|
||||||
|
lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
|
||||||
|
3.9M opam-devel.pack
|
||||||
|
2.0M opam-devel.mbox.gzip
|
||||||
|
10M lists.ocaml.org-archive/pipermail/opam-devel.mbox/opam-devel.mbox
|
||||||
|
</code></pre>
|
||||||
|
<p>The compression ratio is a bit worse than before, but we're still on to
|
||||||
|
something interesting. Here again we can take an object from our PACK file and
|
||||||
|
see how the compression between objects reacts:</p>
|
||||||
|
<pre><code class="language-shell">$ blaze.pack index opam-devel.pack
|
||||||
|
...
|
||||||
|
09bbd28303c8aafafd996b56f9c071a3add7bd92 b 362504 271 10 60793428 412b1fbeb6ee4a05fe8587033c1a1d8ca2ef5b35
|
||||||
|
$ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
|
||||||
|
--without-metadata --raw | wc -c
|
||||||
|
2098
|
||||||
|
</code></pre>
|
||||||
|
<p>Once again, we see a ratio of 10! Here the object corresponds to the header of
|
||||||
|
an email. This is compressed with other email headers. This is the situation
|
||||||
|
where the fields are common to all your emails (<code>From</code>, <code>Subject</code>, etc.).
|
||||||
|
Carton successfully patches headers together and email content together.</p>
|
||||||
|
<h2 id="next-things"><a class="anchor" aria-hidden="true" href="#next-things"></a>Next things</h2>
|
||||||
|
<p>All the work done on email archiving is aimed at producing a unikernel (<code>void</code>)
|
||||||
|
that could archive all incoming emails. Finally, this unikernel could send the
|
||||||
|
archive back (via an email!) to those who want it. This is one of our goals for
|
||||||
|
implementing a mailing list in OCaml with unikernels.</p>
|
||||||
|
<p>Another objective is to create a database system for emails in order to offer
|
||||||
|
two features to the user that we consider important:</p>
|
||||||
|
<ul>
|
||||||
|
<li>quick and easy access to emails</li>
|
||||||
|
<li>save disk space through compression</li>
|
||||||
|
</ul>
|
||||||
|
<p>With this system, we can extend the method of indexing emails with other
|
||||||
|
information such as the keywords found in the emails. This will enable us,
|
||||||
|
among other things, to create an email search engine!</p>
|
||||||
|
<h2 id="conclusion"><a class="anchor" aria-hidden="true" href="#conclusion"></a>Conclusion</h2>
|
||||||
|
<p>This milestone in our PTT project was quite long, as we were very interested in
|
||||||
|
metrics such as compression ratio and software execution speed.</p>
|
||||||
|
<p>The experience we'd gained with emails (in particular with Mr. MIME) enabled us
|
||||||
|
to move a little faster, especially in terms of serializing our emails. Our
|
||||||
|
experience with ocaml-git also enabled us to identify the benefits of the PACK
|
||||||
|
file for emails.</p>
|
||||||
|
<p>But the development of <a href="https://github.com/robur-coop/miou">Miou</a> was particularly helpful in satisfying us in
|
||||||
|
terms of program execution time, thanks to the ability to parallelize certain
|
||||||
|
calculations quite easily.</p>
|
||||||
|
<p>The format is still a little rough and not quite ready for the development of a
|
||||||
|
keyword-based e-mail indexing system, but it provides a good basis for the rest
|
||||||
|
of our project.</p>
|
||||||
|
<p>So, if you like what we're doing and want to help, you can make a donation via
|
||||||
|
<a href="https://github.com/sponsors/robur-coop">GitHub</a> or using our <a href="https://robur.coop/Donate">IBAN</a>.</p>
|
||||||
|
|
||||||
|
</article>
|
||||||
|
|
||||||
|
</main>
|
||||||
|
<footer>
|
||||||
|
<a href="https://github.com/xhtmlboi/yocaml">Powered by <strong>YOCaml</strong></a>
|
||||||
|
<br />
|
||||||
|
</footer>
|
||||||
|
<script>hljs.highlightAll();</script>
|
||||||
|
</body>
|
||||||
|
</html>
|
Loading…
Reference in a new issue