file size and parallel downloads

hannes commented

2024-10-18 09:20:34 +00:00

Owner

If I remember correctly, we do parallel downloads and (try to) occupy the needed bytes in the tar file by putting a good header there, and then do stream writes.

Now, we encountered that we don't get the file size in a lot of circumstances. I was curious whether we can use a HTTP HEAD request first, and hopefully get a content size back in that? we should test with some archive(s) where we don't get the content length via the usual GET request.

If I remember correctly, we do parallel downloads and (try to) occupy the needed bytes in the tar file by putting a good header there, and then do stream writes. Now, we encountered that we don't get the file size in a lot of circumstances. I was curious whether we can use a HTTP HEAD request first, and hopefully get a content size back in that? we should test with some archive(s) where we don't get the content length via the usual GET request.

hannes commented

2024-10-20 20:57:36 +00:00

Author

Owner

Turns out e.g. GitHub is very sporadic when it comes to content-length headers. And HEAD requests don't change a thing here. See https://github.com/orgs/community/discussions/76604 -- It may help to use HTTP/1.1 instead of HTTP/2 (for GitHub).

The issue with no content-length header is that we then keep the file in memory while downloading, and dumping only at the latest step to disk -- i.e. we fill quite some memory before dumping something to disk.

But we could do better, and stream-dump to a separate place on our disk, and once done stream-read and write to the tar. So I'll leave this open, since I think it makes sense to have a bunch of spare dump parts on our block device for temporarily dumping downloads without content-length. WDYT?

Turns out e.g. GitHub is very sporadic when it comes to content-length headers. And HEAD requests don't change a thing here. See https://github.com/orgs/community/discussions/76604 -- It may help to use HTTP/1.1 instead of HTTP/2 (for GitHub). The issue with no content-length header is that we then keep the file in memory while downloading, and dumping only at the latest step to disk -- i.e. we fill quite some memory before dumping something to disk. But we could do better, and stream-dump to a separate place on our disk, and once done stream-read and write to the tar. So I'll leave this open, since I think it makes sense to have a bunch of spare dump parts on our block device for temporarily dumping downloads without content-length. WDYT?

reynir commented

2024-10-21 08:24:28 +00:00

Owner

Yes, I have had in my mind it could be interesting to have temporary dumps to block devices. Since there may be several tasks that want to access this swap-like partition I think we would need to track block allocations in memory.

hannes commented

2024-10-21 08:30:49 +00:00

Author

Owner

Yes, I have had in my mind it could be interesting to have temporary dumps to block devices. Since there may be several tasks that want to access this swap-like partition I think we would need to track block allocations in memory.

..or have one swap space that grows from the beginning of the caches back to the end of the tar data, and have all other content-length-unknown downloads waiting until the one completed. I thought we could have multiple "swap spaces", but then we'd need to decide on a size for each, and I'm not sure what a good size would be - given that some artifacts in opam have ~500MB.

From my above experiments, retrying a GitHub GET request multiple times can lead to one where a content-length is included. This is strange, but that's how the Internet world is out there... This was with curl -v and HTTP2 being used.

> Yes, I have had in my mind it could be interesting to have temporary dumps to block devices. Since there may be several tasks that want to access this swap-like partition I think we would need to track block allocations in memory. ..or have one swap space that grows from the beginning of the caches back to the end of the tar data, and have all other content-length-unknown downloads waiting until the one completed. I thought we could have multiple "swap spaces", but then we'd need to decide on a size for each, and I'm not sure what a good size would be - given that some artifacts in opam have ~500MB. From my above experiments, retrying a GitHub GET request multiple times can lead to one where a `content-length` is included. This is strange, but that's how the Internet world is out there... This was with curl -v and HTTP2 being used.

hannes commented

2024-11-08 12:53:18 +00:00

Author

Owner

done in #16

hannes closed this issue

2024-11-08 12:53:18 +00:00

file size and parallel downloads #10