Finish it
This commit is contained in:
parent
086485e904
commit
3ebe667432
1 changed files with 33 additions and 14 deletions
|
@ -1,7 +1,7 @@
|
|||
---
|
||||
date: 2024-10-29
|
||||
title: Git, Carton, Mmap and emails
|
||||
description: A way to store your emails
|
||||
date: 2025-01-07
|
||||
title: Git, Carton and emails
|
||||
description: A way to store and archive your emails
|
||||
tags:
|
||||
- emails
|
||||
- storage
|
||||
|
@ -99,7 +99,7 @@ previous ones (as we would have to do for a `*.tar.gz` archive, for example).
|
|||
|
||||
The initial intuition about the emails was right, they do indeed share quite a
|
||||
few elements together and in the end we were able to save ~4000 bytes in our
|
||||
example GitHub notification example.
|
||||
GitHub notification example.
|
||||
|
||||
## Isomorphism, DKIM and ARC
|
||||
|
||||
|
@ -204,14 +204,12 @@ let encode =
|
|||
|
||||
However, we need to check that the isomorphism is respected. You should be
|
||||
aware that work on [Mr. MIME][mrmime] has already been done on this subject with
|
||||
the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. So we went
|
||||
one step further and generated random emails from this fuzzer! This allows me to
|
||||
reintroduce you to the [hamlet][hamlet] project, perhaps the biggest database of
|
||||
valid — but incomprehensible — emails. In this case, Mr. MIME does more than just
|
||||
encode/decode as we want to do here, it also parses email addresses, dates, and
|
||||
so on. Here, the aim is mainly to split our emails. We therefore took the time
|
||||
to check the isomorphism on our `hamlet` database: the result is that among
|
||||
these 1M emails, not one has been altered!
|
||||
the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. This
|
||||
ability to check isomorphism using afl has enabled us to use the latter to
|
||||
generate valid random emails. This allows me to reintroduce you to the
|
||||
[hamlet][hamlet] project, perhaps the biggest database of valid — but
|
||||
incomprehensible — emails. So we've checked that our encoder/decoder for
|
||||
“splitting” our emails respects isomophism on this million emails.
|
||||
|
||||
## Carton, POP3 & mbox, some metrics
|
||||
|
||||
|
@ -325,8 +323,9 @@ therefore a succession of patches between objects. What Carton tries to do is
|
|||
to find the best patch from a window of possibilities and keep the best
|
||||
candidates for reuse. If we unroll this chain of patches, we find a "base"
|
||||
object (at `0x000026ab`) that is just compressed with zlib and the latter is
|
||||
also the content of another GitHub notification email: this checks that our
|
||||
Rabin's fingerprinting algorithm works very well.
|
||||
also the content of another GitHub notification email. This shows that Carton
|
||||
is well on its way to finding the best candidate for the patch, which should be
|
||||
similar content, moreover, another GitHub notification.
|
||||
|
||||
### Mbox and real emails
|
||||
|
||||
|
@ -365,6 +364,7 @@ $ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
|
|||
Once again, we see a ratio of 10! Here the object corresponds to the header of
|
||||
an email. This is compressed with other email headers. This is the situation
|
||||
where the fields are common to all your emails (`From`, `Subject`, etc.).
|
||||
Carton successfully patches headers together and email content together.
|
||||
|
||||
## Next things
|
||||
|
||||
|
@ -383,3 +383,22 @@ information such as the keywords found in the emails. This will enable us,
|
|||
among other things, to create an email search engine!
|
||||
|
||||
## Conclusion
|
||||
|
||||
This milestone in our PTT project was quite long, as we were very interested in
|
||||
metrics such as compression ratio and software execution speed.
|
||||
|
||||
The experience we'd gained with emails (in particular with Mr. MIME) enabled us
|
||||
to move a little faster, especially in terms of serializing our emails. Our
|
||||
experience with ocaml-git also enabled us to identify the benefits of the PACK
|
||||
file for emails.
|
||||
|
||||
But the development of Miou was particularly helpful in satisfying us in terms
|
||||
of program execution time, thanks to the ability to parallelize certain
|
||||
calculations quite easily.
|
||||
|
||||
The format is still a little rough and not quite ready for the development of a
|
||||
keyword-based e-mail indexing system, but it provides a good basis for the rest
|
||||
of our project.
|
||||
|
||||
So, if you like what we're doing and want to help, you can make a donation via
|
||||
[GitHub][donate-github] or using our [IBAN][donate-iban].
|
||||
|
|
Loading…
Reference in a new issue