Git to/of_octets and startup #20

Open
opened 2024-11-13 12:35:00 +00:00 by hannes · 5 comments
Owner

As remarked by @dinosaure in #18 (comment), we don't need to check the PACK checksums.

But I wonder whether the "we store the git repo to disk" is the path forward, given that all we need for operation is the tarball.

So, an alternative approach would be to:

  • preserve the index.tar.gz (together with the repo file) to disk once built
  • at startup read that file, and we can already provide http service
  • we also update the index.tar.gz by starting a git clone

If we go this path, I guess since the tarball is much smaller than the git repository, our time from boot to service, would be much smaller.

As remarked by @dinosaure in https://git.robur.coop/robur/opam-mirror/issues/18#issuecomment-418, we don't need to check the PACK checksums. But I wonder whether the "we store the git repo to disk" is the path forward, given that all we need for operation is the tarball. So, an alternative approach would be to: - preserve the index.tar.gz (together with the repo file) to disk once built - at startup read that file, and we can already provide http service - we also update the index.tar.gz by starting a git clone If we go this path, I guess since the tarball is much smaller than the git repository, our time from boot to service, would be much smaller.
Author
Owner

I guess the question I have in mind is whether a git_kv pull from scratch (without an existing git_kv) is bandwidth-wise much more expensive than if we have a repository already.

An alternative path would be:

  • store index.tar.gz as well as opam repository git archive
  • on startup, restore index.tar.gz first, and then read&restore the git archive to do a pull

What do you think?

I guess the question I have in mind is whether a git_kv pull from scratch (without an existing git_kv) is bandwidth-wise much more expensive than if we have a repository already. An alternative path would be: - store index.tar.gz as well as opam repository git archive - on startup, restore index.tar.gz first, and then read&restore the git archive to do a pull What do you think?
Author
Owner

what is still unclear to me is whether we should only dump the index.tar.gz or also the git repository?

and if the answer is yes to the latter, should we store it uncompressed? the issue is we'd need a bigger git partition than.

what is still unclear to me is whether we should only dump the index.tar.gz or also the git repository? and if the answer is yes to the latter, should we store it uncompressed? the issue is we'd need a bigger git partition than.
Author
Owner

From further discussion, it became clear to me:

  • revise partitions: add space for index.tar.gz etc.
  • also increase git dump partition and play around with level
From further discussion, it became clear to me: - revise partitions: add space for index.tar.gz etc. - also increase git dump partition and play around with level
Owner

I guess the question I have in mind is whether a git_kv pull from scratch (without an existing git_kv) is bandwidth-wise much more expensive than if we have a repository already.

A Git_kv.pull from scratch is probably less or equal bandwidth-wise because we ask only for the last commit, see:

( match t.head with
| None -> Lwt.return (`Depth 1)
| Some head ->
Store.read_exn t.store head >>= fun value ->
let[@warning "-8"] Git.Value.Commit commit = value in
(* TODO(dinosaure): we should handle correctly [tz] and re-calculate the timestamp. *)
let { Git.User.date= (timestamp, _tz); _ } = Store.Value.Commit.author commit in
Lwt.return (`Timestamp timestamp) ) >>= fun deepen ->

> I guess the question I have in mind is whether a git_kv pull from scratch (without an existing git_kv) is bandwidth-wise much more expensive than if we have a repository already. A `Git_kv.pull` from scratch is probably less or equal bandwidth-wise because we ask only for the last commit, see: https://git.robur.coop/robur/git-kv/src/commit/bc190bd0547566996d11d6be3de86fa794f82fa8/src/git_kv.ml#L73-L80
Owner

If you do a Git_kv.pull from an existing git-kv, we must take into account diffs between your actual snapshot and what we have on opam-repository. So we ask few commits to be able to generate the diff. But from scratch, we only ask for the last commit - it's really like a git clone --depth=1 ....

This may be costly because there is nothing between the server and the unikernel (since the unikernel has nothing), so we can't find common ancestors that could produce a "thin" PACK file (lighter but requiring the unikernel to have Git objects).

Intuitively, this should have little impact on the time needed to check the PACK file. But not a big speed-up.

If you do a `Git_kv.pull` from an existing `git-kv`, we must take into account diffs between your actual snapshot and what we have on `opam-repository`. So we ask few commits to be able to generate the diff. But from scratch, we only ask for the last commit - it's really like a `git clone --depth=1 ...`. This may be costly because there is nothing between the server and the unikernel (since the unikernel has nothing), so we can't find _common ancestors_ that could produce a "thin" PACK file (lighter but requiring the unikernel to have Git objects). Intuitively, this should have little impact on the time needed to check the PACK file. But not a big speed-up.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: robur/opam-mirror#20
No description provided.