Finish it

This commit is contained in:
Calascibetta Romain 2025-01-13 12:55:00 +01:00
parent 086485e904
commit 3ebe667432

View file

@ -1,7 +1,7 @@
---
date: 2024-10-29
title: Git, Carton, Mmap and emails
description: A way to store your emails
date: 2025-01-07
title: Git, Carton and emails
description: A way to store and archive your emails
tags:
- emails
- storage
@ -99,7 +99,7 @@ previous ones (as we would have to do for a `*.tar.gz` archive, for example).
The initial intuition about the emails was right, they do indeed share quite a
few elements together and in the end we were able to save ~4000 bytes in our
example GitHub notification example.
GitHub notification example.
## Isomorphism, DKIM and ARC
@ -204,14 +204,12 @@ let encode =
However, we need to check that the isomorphism is respected. You should be
aware that work on [Mr. MIME][mrmime] has already been done on this subject with
the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. So we went
one step further and generated random emails from this fuzzer! This allows me to
reintroduce you to the [hamlet][hamlet] project, perhaps the biggest database of
valid — but incomprehensible — emails. In this case, Mr. MIME does more than just
encode/decode as we want to do here, it also parses email addresses, dates, and
so on. Here, the aim is mainly to split our emails. We therefore took the time
to check the isomorphism on our `hamlet` database: the result is that among
these 1M emails, not one has been altered!
the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. This
ability to check isomorphism using afl has enabled us to use the latter to
generate valid random emails. This allows me to reintroduce you to the
[hamlet][hamlet] project, perhaps the biggest database of valid — but
incomprehensible — emails. So we've checked that our encoder/decoder for
“splitting” our emails respects isomophism on this million emails.
## Carton, POP3 & mbox, some metrics
@ -325,8 +323,9 @@ therefore a succession of patches between objects. What Carton tries to do is
to find the best patch from a window of possibilities and keep the best
candidates for reuse. If we unroll this chain of patches, we find a "base"
object (at `0x000026ab`) that is just compressed with zlib and the latter is
also the content of another GitHub notification email: this checks that our
Rabin's fingerprinting algorithm works very well.
also the content of another GitHub notification email. This shows that Carton
is well on its way to finding the best candidate for the patch, which should be
similar content, moreover, another GitHub notification.
### Mbox and real emails
@ -365,6 +364,7 @@ $ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
Once again, we see a ratio of 10! Here the object corresponds to the header of
an email. This is compressed with other email headers. This is the situation
where the fields are common to all your emails (`From`, `Subject`, etc.).
Carton successfully patches headers together and email content together.
## Next things
@ -383,3 +383,22 @@ information such as the keywords found in the emails. This will enable us,
among other things, to create an email search engine!
## Conclusion
This milestone in our PTT project was quite long, as we were very interested in
metrics such as compression ratio and software execution speed.
The experience we'd gained with emails (in particular with Mr. MIME) enabled us
to move a little faster, especially in terms of serializing our emails. Our
experience with ocaml-git also enabled us to identify the benefits of the PACK
file for emails.
But the development of Miou was particularly helpful in satisfying us in terms
of program execution time, thanks to the ability to parallelize certain
calculations quite easily.
The format is still a little rough and not quite ready for the development of a
keyword-based e-mail indexing system, but it provides a good basis for the rest
of our project.
So, if you like what we're doing and want to help, you can make a donation via
[GitHub][donate-github] or using our [IBAN][donate-iban].