From 3ebe667432143064c4c5290a76d596c6bbd34a41 Mon Sep 17 00:00:00 2001 From: Calascibetta Romain Date: Mon, 13 Jan 2025 12:55:00 +0100 Subject: [PATCH] Finish it --- articles/2025-01-07-carton-and-cachet.md | 47 +++++++++++++++++------- 1 file changed, 33 insertions(+), 14 deletions(-) diff --git a/articles/2025-01-07-carton-and-cachet.md b/articles/2025-01-07-carton-and-cachet.md index 9b86086..cf8c541 100644 --- a/articles/2025-01-07-carton-and-cachet.md +++ b/articles/2025-01-07-carton-and-cachet.md @@ -1,7 +1,7 @@ --- -date: 2024-10-29 -title: Git, Carton, Mmap and emails -description: A way to store your emails +date: 2025-01-07 +title: Git, Carton and emails +description: A way to store and archive your emails tags: - emails - storage @@ -99,7 +99,7 @@ previous ones (as we would have to do for a `*.tar.gz` archive, for example). The initial intuition about the emails was right, they do indeed share quite a few elements together and in the end we were able to save ~4000 bytes in our -example GitHub notification example. +GitHub notification example. ## Isomorphism, DKIM and ARC @@ -204,14 +204,12 @@ let encode = However, we need to check that the isomorphism is respected. You should be aware that work on [Mr. MIME][mrmime] has already been done on this subject with -the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. So we went -one step further and generated random emails from this fuzzer! This allows me to -reintroduce you to the [hamlet][hamlet] project, perhaps the biggest database of -valid — but incomprehensible — emails. In this case, Mr. MIME does more than just -encode/decode as we want to do here, it also parses email addresses, dates, and -so on. Here, the aim is mainly to split our emails. We therefore took the time -to check the isomorphism on our `hamlet` database: the result is that among -these 1M emails, not one has been altered! +the [afl][afl] fuzzer: check our assertion `x == decode(encode(x))`. This +ability to check isomorphism using afl has enabled us to use the latter to +generate valid random emails. This allows me to reintroduce you to the +[hamlet][hamlet] project, perhaps the biggest database of valid — but +incomprehensible — emails. So we've checked that our encoder/decoder for +“splitting” our emails respects isomophism on this million emails. ## Carton, POP3 & mbox, some metrics @@ -325,8 +323,9 @@ therefore a succession of patches between objects. What Carton tries to do is to find the best patch from a window of possibilities and keep the best candidates for reuse. If we unroll this chain of patches, we find a "base" object (at `0x000026ab`) that is just compressed with zlib and the latter is -also the content of another GitHub notification email: this checks that our -Rabin's fingerprinting algorithm works very well. +also the content of another GitHub notification email. This shows that Carton +is well on its way to finding the best candidate for the patch, which should be +similar content, moreover, another GitHub notification. ### Mbox and real emails @@ -365,6 +364,7 @@ $ carton get opam-devel.idx 09bbd28303c8aafafd996b56f9c071a3add7bd92 \ Once again, we see a ratio of 10! Here the object corresponds to the header of an email. This is compressed with other email headers. This is the situation where the fields are common to all your emails (`From`, `Subject`, etc.). +Carton successfully patches headers together and email content together. ## Next things @@ -383,3 +383,22 @@ information such as the keywords found in the emails. This will enable us, among other things, to create an email search engine! ## Conclusion + +This milestone in our PTT project was quite long, as we were very interested in +metrics such as compression ratio and software execution speed. + +The experience we'd gained with emails (in particular with Mr. MIME) enabled us +to move a little faster, especially in terms of serializing our emails. Our +experience with ocaml-git also enabled us to identify the benefits of the PACK +file for emails. + +But the development of Miou was particularly helpful in satisfying us in terms +of program execution time, thanks to the ability to parallelize certain +calculations quite easily. + +The format is still a little rough and not quite ready for the development of a +keyword-based e-mail indexing system, but it provides a good basis for the rest +of our project. + +So, if you like what we're doing and want to help, you can make a donation via +[GitHub][donate-github] or using our [IBAN][donate-iban].