reproducible tarball index.tar.gz? same as on opam.ocaml.org? #11

Open
opened 2024-10-18 09:22:29 +00:00 by hannes · 3 comments
Owner

we should strive for having a reproducible tarball. we may need to adjust whatever opam.ocaml.org uses to get a reproducible output. when I remember correctly, https://github.com/ocaml-opam/opam2web/ is used for generating the opam.ocaml.org tarball (and does a lot of other things).

we should strive for having a reproducible tarball. we may need to adjust whatever opam.ocaml.org uses to get a reproducible output. when I remember correctly, https://github.com/ocaml-opam/opam2web/ is used for generating the opam.ocaml.org tarball (and does a lot of other things).
Author
Owner

The command used is actually in opam (make_index_tar_gz in opamHTTP,ml, used by opam admin):

  OpamFilename.in_dir repo_root (fun () ->
    let to_include = [ "version"; "packages"; "repo" ] in
    match List.filter Sys.file_exists to_include with
    | [] -> ()
    | d  -> OpamSystem.command ("tar" :: "czhf" :: "index.tar.gz" :: "--exclude=.git*" :: d)
  )

so, [c]reate, [z]ip, [h] (synonym for [L] - dereference, follow symlinks), [f]ile. hard to reproduce due to timestamps of files (or are git pulled files timestamped to the git commit they appear(ed) in)?

The command used is actually in opam (make_index_tar_gz in opamHTTP,ml, used by opam admin): ``` OpamFilename.in_dir repo_root (fun () -> let to_include = [ "version"; "packages"; "repo" ] in match List.filter Sys.file_exists to_include with | [] -> () | d -> OpamSystem.command ("tar" :: "czhf" :: "index.tar.gz" :: "--exclude=.git*" :: d) ) ``` so, [c]reate, [z]ip, [h] (synonym for [L] - dereference, follow symlinks), [f]ile. hard to reproduce due to timestamps of files (or are git pulled files timestamped to the git commit they appear(ed) in)?
Owner

I downloaded https://opam.ocaml.org/index.tar.gz and I found all timestamps were the same except for repo:

$ tar -tvf index.tar.gz  | awk '{ print $4,$5 }' | sort | uniq -c
  69832 2024-10-18 19:24
      1 2024-10-18 19:38

My suspicion is the timestamp is when the opam2web was built, and repo was built 15 mins later for some reason.

Then when writing this I notice in the output the resolution is in minutes, but in the tar format it's in seconds:

$ tar --full-time -tvf index.tar.gz  | awk '{ print $4, $5 }' | sort | uniq -c
   1170 2024-10-18 19:24:52
  10151 2024-10-18 19:24:53
  11069 2024-10-18 19:24:54
  11850 2024-10-18 19:24:55
  11794 2024-10-18 19:24:56
  11633 2024-10-18 19:24:57
  12165 2024-10-18 19:24:58
      1 2024-10-18 19:38:32

Perhaps we can ask the maintainers of opam2web to set the mtime to the timestamp of the commit? With GNU tar that would be tar --mtime $COMMIT_TIMESTAMP czhf index.tar.gz --exclude=.git* .... Then I read man bsdtar and there is no such option /o\

I downloaded https://opam.ocaml.org/index.tar.gz and I found all timestamps were the same except for `repo`: ``` $ tar -tvf index.tar.gz | awk '{ print $4,$5 }' | sort | uniq -c 69832 2024-10-18 19:24 1 2024-10-18 19:38 ``` My suspicion is the timestamp is when the opam2web was built, and `repo` was built 15 mins later for some reason. Then when writing this I notice in the output the resolution is in minutes, but in the tar format it's in seconds: ``` $ tar --full-time -tvf index.tar.gz | awk '{ print $4, $5 }' | sort | uniq -c 1170 2024-10-18 19:24:52 10151 2024-10-18 19:24:53 11069 2024-10-18 19:24:54 11850 2024-10-18 19:24:55 11794 2024-10-18 19:24:56 11633 2024-10-18 19:24:57 12165 2024-10-18 19:24:58 1 2024-10-18 19:38:32 ``` Perhaps we can ask the maintainers of opam~~2web~~ to set the mtime to the timestamp of the commit? With GNU tar that would be `tar --mtime $COMMIT_TIMESTAMP czhf index.tar.gz --exclude=.git* ...`. Then I read `man bsdtar` and there is no such option /o\
Author
Owner

I was as well looking a bit more into this issue. Thanks for your investigations.

There are some paths forward for opam:

I prefer the second option, since it is more portable. I plan to go ahead and PR that to the opam developers, who seem to be keen to have index.tar.gz cached via last-modified / etag (see https://github.com/ocaml/opam/issues/5553).

I was as well looking a bit more into this issue. Thanks for your investigations. There are some paths forward for opam: - https://www.gnu.org/software/tar/manual/html_node/Reproducibility.html - Basically: have a mtime that corresponds to the latest git commit, and include that - set owner/group and permissions uniformly - disable headers not needed - sort by name using a C locale - Use what topkg pioneered https://erratique.ch/software/topkg/doc/Topkg_care/Archive/index.html (130 lines of code) - use only gzip as external command I prefer the second option, since it is more portable. I plan to go ahead and PR that to the opam developers, who seem to be keen to have index.tar.gz cached via last-modified / etag (see https://github.com/ocaml/opam/issues/5553).
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: robur/opam-mirror#11
No description provided.