revise startup, address urls pointing to same sha256 and support mirrors (upstream and in opam file) #24

Merged
hannes merged 7 commits from startup into main 2024-11-20 10:38:22 +00:00
Owner

startup as proposed in #18:

  • recover git (from disk or download)
  • make index.tar.gz
  • start web service
  • check disk (unless skip-verify-sha256)
  • dump git state
  • start downloads
  • enable update job, and hook

also address #5 #13 #21 (url stuff, deduplicating based on sha256, adding support for mirrors and upstream-caches)

startup as proposed in #18: - recover git (from disk or download) - make index.tar.gz - start web service - check disk (unless skip-verify-sha256) - dump git state - start downloads - enable update job, and hook also address #5 #13 #21 (url stuff, deduplicating based on sha256, adding support for mirrors and upstream-caches)
hannes added 1 commit 2024-11-14 15:38:27 +00:00
- recover git (from disk or download)
- make index.tar.gz
- start web service
- check disk (unless skip-verify-sha256)
- dump git state
- start downloads
- enable update job, and hook
hannes added 1 commit 2024-11-14 15:58:49 +00:00
Author
Owner

I added a second commit which makes the contents available as soon as they're checked. And once the check has been completed, the (mutable) field (checked) is set to None.

I added a second commit which makes the contents available as soon as they're checked. And once the check has been completed, the (mutable) field (`checked`) is set to None.
hannes added 2 commits 2024-11-14 18:25:38 +00:00
This is done by introducing a set of alternative download locations.
this reduces 19139 urls down to 18563 urls
hannes changed title from revise startup (as proposed in #18): to revise startup, address urls pointing to same sha256 and support mirrors (upstream and in opam file) 2024-11-14 18:31:48 +00:00
hannes reviewed 2024-11-14 18:33:22 +00:00
@ -134,1 +142,4 @@
let sha256s = Hashtbl.create 13
let empty () = Hashtbl.clear sha256s
Author
Owner

this code, using a global hashtable, is ugly and should be revised.

this code, using a global hashtable, is ugly and should be revised.
hannes reviewed 2024-11-14 18:34:37 +00:00
@ -152,2 +193,3 @@
then
Some (HM.union (fun _h v _v' -> Some v) csums csums')
Some (HM.union (fun _h v _v' -> Some v) csums csums',
SSet.union mirrors mirrors')
Author
Owner

not doing upstream csums here since we did that for the previous entry already. downside is if there's an artifact that has some hashes in some file, and more hashes in another file, we won't get the other hashes.

not doing `upstream csums` here since we did that for the previous entry already. downside is if there's an artifact that has some hashes in some file, and more hashes in another file, we won't get the other hashes.
hannes added 1 commit 2024-11-14 18:39:45 +00:00
reynir reviewed 2024-11-19 12:11:05 +00:00
reynir left a comment
Owner

This looks interesting. I will do a closer review later.

I like that you use sha256 for comparing duplicate URLs. We could possibly also check the sha512 values (but for simplicity maybe let's not). For md5 I think the risk of collision is too high.

I think it is interesting with skip-download. With #25 an an option to set "archive-mirrors:" we could split the opam repository functionality from the cache functionality. I think that could be interesting for some deployments while for others it's possibly nicer to only have to deploy a single unikernel.

This looks interesting. I will do a closer review later. I like that you use sha256 for comparing duplicate URLs. We could possibly also check the sha512 values (but for simplicity maybe let's not). For md5 I think the risk of collision is too high. I think it is interesting with `skip-download`. With #25 an an option to set "archive-mirrors:" we could split the opam repository functionality from the cache functionality. I think that could be interesting for some deployments while for others it's possibly nicer to only have to deploy a single unikernel.
@ -44,0 +50,4 @@
| Some { pelem = Variable (_, { pelem = String url ; _ }) ; _ } -> [ url ]
| Some { pelem = Variable (_, { pelem = List { pelem = urls ; _ } ; _ }) } ->
List.fold_left (fun acc -> function
| { pelem = String url ; _ } -> url :: acc
Owner

I don't think the order is significant, but it's worth noting we reverse the order here.

I don't think the order is significant, but it's worth noting we reverse the order here.
@ -80,3 +99,3 @@
in
(match url, csum with
| Ok url, Ok csum -> Ok (url, csum)
| Ok url, Ok csum -> Ok (url, csum, mirrors)
Owner

The opam source code does url :: mirrors FWIW. Just an observation; not a request for change.

The opam source code does `url :: mirrors` FWIW. Just an observation; not a request for change.
Author
Owner

We keep the url in the map as key, and just carry around a set of other urls for the same artifact.

Not sure whether it is worth to have the url as key in the map anymore, though we discovered some opam packages that used the same archive with different checksums...

We keep the url in the map as key, and just carry around a set of other urls for the same artifact. Not sure whether it is worth to have the url as key in the map anymore, though we discovered some opam packages that used the same archive with different checksums...
@ -13,0 +14,4 @@
Mirage_runtime.register_arg Arg.(value & flag doc)
let upstream_caches =
let doc = Arg.info ~doc:"Upstream caches (e.g. https://opam.ocaml.org/cache)" ["upstream-cache"] in
Owner

"Upstream caches to use internally (e.g. https://opam.ocaml.org/cache). This makes opam-mirror try the cache(s) before going to the source and mirrors. This does not change the published "archive-mirrors:" value in the /repo endpoint."

"Upstream caches to use internally (e.g. https://opam.ocaml.org/cache). This makes opam-mirror try the cache(s) before going to the source and mirrors. This does *not* change the published "archive-mirrors:" value in the /repo endpoint."
Author
Owner

actually, if I'm not misguided, we first go to the source, and only thereafter to mirror(s). Is this a good semantics?

actually, if I'm not misguided, we first go to the source, and only thereafter to mirror(s). Is this a good semantics?
Author
Owner

And please feel free to push the documentation updates directly to the branch.

And please feel free to push the documentation updates directly to the branch.
Owner

Hm it turns out the exact semantics is a bit complicated to explain. You are right that we first to go the source! Then we go to the mirrors or the cache depending on how the URLs sort as they both go into a string set.

I will give this some thought and suggest a change.

Hm it turns out the exact semantics is a bit complicated to explain. You are right that we first to go the source! Then we go to the mirrors or the cache **depending on how the URLs sort** as they both go into a string set. I will give this some thought and suggest a change.
Owner

Latest commit splits mirror URLs from upstream cache URLs so the mirror URLs are tried first before the upstream caches, and adds a longer description of the option.

Latest commit splits mirror URLs from upstream cache URLs so the mirror URLs are tried first before the upstream caches, and adds a longer description of the option.
hannes marked this conversation as resolved
@ -139,2 +151,3 @@
List.iter (fun (`Msg msg) -> add_parse_error path msg) errs;
List.fold_left (fun acc (url, csums) ->
let upstream hm =
List.fold_left (fun set (hash, hash_value) ->
Owner

Maybe worth using HM.fold

Maybe worth using `HM.fold`
Author
Owner

Indeed, currently it is a List.fold_left of "HM.bindings hm". So yes, this could be nicer code. :)

Indeed, currently it is a List.fold_left of "HM.bindings hm". So yes, this could be nicer code. :)
hannes marked this conversation as resolved
reynir added 1 commit 2024-11-19 15:04:57 +00:00
Also expand on the semantics of --upstream-cache.
reynir added 1 commit 2024-11-19 15:26:27 +00:00
hannes merged commit 2edc311a33 into main 2024-11-20 10:38:22 +00:00
hannes deleted branch startup 2024-11-20 10:38:24 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: robur/opam-mirror#24
No description provided.