Git to/of_octets and startup #20

New issue

Open

opened 2024-11-13 12:35:00 +00:00 by hannes · 5 comments

hannes commented

2024-11-13 12:35:00 +00:00

Owner

As remarked by @dinosaure in #18 (comment), we don't need to check the PACK checksums.

But I wonder whether the "we store the git repo to disk" is the path forward, given that all we need for operation is the tarball.

So, an alternative approach would be to:

preserve the index.tar.gz (together with the repo file) to disk once built
at startup read that file, and we can already provide http service
we also update the index.tar.gz by starting a git clone

If we go this path, I guess since the tarball is much smaller than the git repository, our time from boot to service, would be much smaller.

As remarked by @dinosaure in https://git.robur.coop/robur/opam-mirror/issues/18#issuecomment-418, we don't need to check the PACK checksums. But I wonder whether the "we store the git repo to disk" is the path forward, given that all we need for operation is the tarball. So, an alternative approach would be to: - preserve the index.tar.gz (together with the repo file) to disk once built - at startup read that file, and we can already provide http service - we also update the index.tar.gz by starting a git clone If we go this path, I guess since the tarball is much smaller than the git repository, our time from boot to service, would be much smaller.

hannes commented

2024-11-13 12:45:09 +00:00

Author

Owner

I guess the question I have in mind is whether a git_kv pull from scratch (without an existing git_kv) is bandwidth-wise much more expensive than if we have a repository already.

An alternative path would be:

store index.tar.gz as well as opam repository git archive
on startup, restore index.tar.gz first, and then read&restore the git archive to do a pull

What do you think?

I guess the question I have in mind is whether a git_kv pull from scratch (without an existing git_kv) is bandwidth-wise much more expensive than if we have a repository already. An alternative path would be: - store index.tar.gz as well as opam repository git archive - on startup, restore index.tar.gz first, and then read&restore the git archive to do a pull What do you think?

hannes commented

2024-11-14 15:40:58 +00:00

Author

Owner

what is still unclear to me is whether we should only dump the index.tar.gz or also the git repository?

and if the answer is yes to the latter, should we store it uncompressed? the issue is we'd need a bigger git partition than.

what is still unclear to me is whether we should only dump the index.tar.gz or also the git repository? and if the answer is yes to the latter, should we store it uncompressed? the issue is we'd need a bigger git partition than.

hannes commented

2024-11-20 10:41:35 +00:00

Author

Owner

From further discussion, it became clear to me:

revise partitions: add space for index.tar.gz etc.
also increase git dump partition and play around with level

From further discussion, it became clear to me: - revise partitions: add space for index.tar.gz etc. - also increase git dump partition and play around with level

dinosaure commented

2024-11-20 10:48:23 +00:00

Owner

I guess the question I have in mind is whether a git_kv pull from scratch (without an existing git_kv) is bandwidth-wise much more expensive than if we have a repository already.

A Git_kv.pull from scratch is probably less or equal bandwidth-wise because we ask only for the last commit, see:

   ( match t.head with
   | None -> Lwt.return (`Depth 1)
   | Some head ->
     Store.read_exn t.store head >>= fun value ->
     let[@warning "-8"] Git.Value.Commit commit = value in
     (* TODO(dinosaure): we should handle correctly [tz] and re-calculate the timestamp. *)
     let { Git.User.date= (timestamp, _tz); _ } = Store.Value.Commit.author commit in
     Lwt.return (`Timestamp timestamp) ) >>= fun deepen ->

> I guess the question I have in mind is whether a git_kv pull from scratch (without an existing git_kv) is bandwidth-wise much more expensive than if we have a repository already. A `Git_kv.pull` from scratch is probably less or equal bandwidth-wise because we ask only for the last commit, see: https://git.robur.coop/robur/git-kv/src/commit/bc190bd0547566996d11d6be3de86fa794f82fa8/src/git_kv.ml#L73-L80

dinosaure commented

2024-11-20 10:54:54 +00:00

Owner

If you do a Git_kv.pull from an existing git-kv, we must take into account diffs between your actual snapshot and what we have on opam-repository. So we ask few commits to be able to generate the diff. But from scratch, we only ask for the last commit - it's really like a git clone --depth=1 ....

This may be costly because there is nothing between the server and the unikernel (since the unikernel has nothing), so we can't find common ancestors that could produce a "thin" PACK file (lighter but requiring the unikernel to have Git objects).

Intuitively, this should have little impact on the time needed to check the PACK file. But not a big speed-up.

If you do a `Git_kv.pull` from an existing `git-kv`, we must take into account diffs between your actual snapshot and what we have on `opam-repository`. So we ask few commits to be able to generate the diff. But from scratch, we only ask for the last commit - it's really like a `git clone --depth=1 ...`. This may be costly because there is nothing between the server and the unikernel (since the unikernel has nothing), so we can't find _common ancestors_ that could produce a "thin" PACK file (lighter but requiring the unikernel to have Git objects). Intuitively, this should have little impact on the time needed to check the PACK file. But not a big speed-up.