file size and parallel downloads #10

Closed
opened 2024-10-18 09:20:34 +00:00 by hannes · 4 comments
Owner

If I remember correctly, we do parallel downloads and (try to) occupy the needed bytes in the tar file by putting a good header there, and then do stream writes.

Now, we encountered that we don't get the file size in a lot of circumstances. I was curious whether we can use a HTTP HEAD request first, and hopefully get a content size back in that? we should test with some archive(s) where we don't get the content length via the usual GET request.

If I remember correctly, we do parallel downloads and (try to) occupy the needed bytes in the tar file by putting a good header there, and then do stream writes. Now, we encountered that we don't get the file size in a lot of circumstances. I was curious whether we can use a HTTP HEAD request first, and hopefully get a content size back in that? we should test with some archive(s) where we don't get the content length via the usual GET request.
Author
Owner

Turns out e.g. GitHub is very sporadic when it comes to content-length headers. And HEAD requests don't change a thing here. See https://github.com/orgs/community/discussions/76604 -- It may help to use HTTP/1.1 instead of HTTP/2 (for GitHub).

The issue with no content-length header is that we then keep the file in memory while downloading, and dumping only at the latest step to disk -- i.e. we fill quite some memory before dumping something to disk.

But we could do better, and stream-dump to a separate place on our disk, and once done stream-read and write to the tar. So I'll leave this open, since I think it makes sense to have a bunch of spare dump parts on our block device for temporarily dumping downloads without content-length. WDYT?

Turns out e.g. GitHub is very sporadic when it comes to content-length headers. And HEAD requests don't change a thing here. See https://github.com/orgs/community/discussions/76604 -- It may help to use HTTP/1.1 instead of HTTP/2 (for GitHub). The issue with no content-length header is that we then keep the file in memory while downloading, and dumping only at the latest step to disk -- i.e. we fill quite some memory before dumping something to disk. But we could do better, and stream-dump to a separate place on our disk, and once done stream-read and write to the tar. So I'll leave this open, since I think it makes sense to have a bunch of spare dump parts on our block device for temporarily dumping downloads without content-length. WDYT?
Owner

Yes, I have had in my mind it could be interesting to have temporary dumps to block devices. Since there may be several tasks that want to access this swap-like partition I think we would need to track block allocations in memory.

Yes, I have had in my mind it could be interesting to have temporary dumps to block devices. Since there may be several tasks that want to access this swap-like partition I think we would need to track block allocations in memory.
Author
Owner

Yes, I have had in my mind it could be interesting to have temporary dumps to block devices. Since there may be several tasks that want to access this swap-like partition I think we would need to track block allocations in memory.

..or have one swap space that grows from the beginning of the caches back to the end of the tar data, and have all other content-length-unknown downloads waiting until the one completed. I thought we could have multiple "swap spaces", but then we'd need to decide on a size for each, and I'm not sure what a good size would be - given that some artifacts in opam have ~500MB.

From my above experiments, retrying a GitHub GET request multiple times can lead to one where a content-length is included. This is strange, but that's how the Internet world is out there... This was with curl -v and HTTP2 being used.

> Yes, I have had in my mind it could be interesting to have temporary dumps to block devices. Since there may be several tasks that want to access this swap-like partition I think we would need to track block allocations in memory. ..or have one swap space that grows from the beginning of the caches back to the end of the tar data, and have all other content-length-unknown downloads waiting until the one completed. I thought we could have multiple "swap spaces", but then we'd need to decide on a size for each, and I'm not sure what a good size would be - given that some artifacts in opam have ~500MB. From my above experiments, retrying a GitHub GET request multiple times can lead to one where a `content-length` is included. This is strange, but that's how the Internet world is out there... This was with curl -v and HTTP2 being used.
Author
Owner

done in #16

done in #16
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: robur/opam-mirror#10
No description provided.