forked from robur/blog.robur.coop
Finish it
This commit is contained in:
parent
086485e904
commit
3ebe667432
1 changed files with 33 additions and 14 deletions
|
@ -1,7 +1,7 @@
|
||||||
---
|
---
|
||||||
date: 2024-10-29
|
date: 2025-01-07
|
||||||
title: Git, Carton, Mmap and emails
|
title: Git, Carton and emails
|
||||||
description: A way to store your emails
|
description: A way to store and archive your emails
|
||||||
tags:
|
tags:
|
||||||
- emails
|
- emails
|
||||||
- storage
|
- storage
|
||||||
|
@ -99,7 +99,7 @@ previous ones (as we would have to do for a `*.tar.gz` archive, for example).
|
||||||
|
|
||||||
The initial intuition about the emails was right, they do indeed share quite a
|
The initial intuition about the emails was right, they do indeed share quite a
|
||||||
few elements together and in the end we were able to save ~4000 bytes in our
|
few elements together and in the end we were able to save ~4000 bytes in our
|
||||||
example GitHub notification example.
|
GitHub notification example.
|
||||||
|
|
||||||
## Isomorphism, DKIM and ARC
|
## Isomorphism, DKIM and ARC
|
||||||
|
|
||||||
|
@ -204,14 +204,12 @@ let encode =
|
||||||
|
|
||||||
However, we need to check that the isomorphism is respected. You should be
|
However, we need to check that the isomorphism is respected. You should be
|
||||||
aware that work on [Mr. MIME][mrmime] has already been done on this subject with
|
aware that work on [Mr. MIME][mrmime] has already been done on this subject with
|
||||||
the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. So we went
|
the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. This
|
||||||
one step further and generated random emails from this fuzzer! This allows me to
|
ability to check isomorphism using afl has enabled us to use the latter to
|
||||||
reintroduce you to the [hamlet][hamlet] project, perhaps the biggest database of
|
generate valid random emails. This allows me to reintroduce you to the
|
||||||
valid — but incomprehensible — emails. In this case, Mr. MIME does more than just
|
[hamlet][hamlet] project, perhaps the biggest database of valid — but
|
||||||
encode/decode as we want to do here, it also parses email addresses, dates, and
|
incomprehensible — emails. So we've checked that our encoder/decoder for
|
||||||
so on. Here, the aim is mainly to split our emails. We therefore took the time
|
“splitting” our emails respects isomophism on this million emails.
|
||||||
to check the isomorphism on our `hamlet` database: the result is that among
|
|
||||||
these 1M emails, not one has been altered!
|
|
||||||
|
|
||||||
## Carton, POP3 & mbox, some metrics
|
## Carton, POP3 & mbox, some metrics
|
||||||
|
|
||||||
|
@ -325,8 +323,9 @@ therefore a succession of patches between objects. What Carton tries to do is
|
||||||
to find the best patch from a window of possibilities and keep the best
|
to find the best patch from a window of possibilities and keep the best
|
||||||
candidates for reuse. If we unroll this chain of patches, we find a "base"
|
candidates for reuse. If we unroll this chain of patches, we find a "base"
|
||||||
object (at `0x000026ab`) that is just compressed with zlib and the latter is
|
object (at `0x000026ab`) that is just compressed with zlib and the latter is
|
||||||
also the content of another GitHub notification email: this checks that our
|
also the content of another GitHub notification email. This shows that Carton
|
||||||
Rabin's fingerprinting algorithm works very well.
|
is well on its way to finding the best candidate for the patch, which should be
|
||||||
|
similar content, moreover, another GitHub notification.
|
||||||
|
|
||||||
### Mbox and real emails
|
### Mbox and real emails
|
||||||
|
|
||||||
|
@ -365,6 +364,7 @@ $ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \
|
||||||
Once again, we see a ratio of 10! Here the object corresponds to the header of
|
Once again, we see a ratio of 10! Here the object corresponds to the header of
|
||||||
an email. This is compressed with other email headers. This is the situation
|
an email. This is compressed with other email headers. This is the situation
|
||||||
where the fields are common to all your emails (`From`, `Subject`, etc.).
|
where the fields are common to all your emails (`From`, `Subject`, etc.).
|
||||||
|
Carton successfully patches headers together and email content together.
|
||||||
|
|
||||||
## Next things
|
## Next things
|
||||||
|
|
||||||
|
@ -383,3 +383,22 @@ information such as the keywords found in the emails. This will enable us,
|
||||||
among other things, to create an email search engine!
|
among other things, to create an email search engine!
|
||||||
|
|
||||||
## Conclusion
|
## Conclusion
|
||||||
|
|
||||||
|
This milestone in our PTT project was quite long, as we were very interested in
|
||||||
|
metrics such as compression ratio and software execution speed.
|
||||||
|
|
||||||
|
The experience we'd gained with emails (in particular with Mr. MIME) enabled us
|
||||||
|
to move a little faster, especially in terms of serializing our emails. Our
|
||||||
|
experience with ocaml-git also enabled us to identify the benefits of the PACK
|
||||||
|
file for emails.
|
||||||
|
|
||||||
|
But the development of Miou was particularly helpful in satisfying us in terms
|
||||||
|
of program execution time, thanks to the ability to parallelize certain
|
||||||
|
calculations quite easily.
|
||||||
|
|
||||||
|
The format is still a little rough and not quite ready for the development of a
|
||||||
|
keyword-based e-mail indexing system, but it provides a good basis for the rest
|
||||||
|
of our project.
|
||||||
|
|
||||||
|
So, if you like what we're doing and want to help, you can make a donation via
|
||||||
|
[GitHub][donate-github] or using our [IBAN][donate-iban].
|
||||||
|
|
Loading…
Reference in a new issue