swapfs error handling: not enough space #19

Open
opened 2024-11-13 12:21:23 +00:00 by hannes · 2 comments
Owner

When this error occurs, we should re-schedule the download with fewer parallel downloads -- or have some other recovery strategy. It happens with an empty data file and 20 parallel downloads on our server (with default sizes of partitionings).

When this error occurs, we should re-schedule the download with fewer parallel downloads -- or have some other recovery strategy. It happens with an empty data file and 20 parallel downloads on our server (with default sizes of partitionings).
Owner

Hmm! I would expect 1 GB to be enough (assuming defaults). Then again, that's "only" about 50 MiB per download task. Then there's up to 20 MB of unusable storage due to the blocking factor / misalignment. It's also possible there is a leak somewhere.

Thinking about it a bit more I remember different versions of the same package are often downloaded "sequentially" (not in the strict sense; "close to at the same time"). And often the source archives are of a similar size between releases. So randomizing the order would, I think, make this less likely to occur.

I agree it would be a good idea to lower the number of concurrent downloads - maybe 4 or 8?

Finally, I agree about having a better recovery strategy is needed - one way or another.

Hmm! I would expect 1 GB to be enough (assuming defaults). Then again, that's "only" about 50 MiB per download task. Then there's up to 20 MB of unusable storage due to the blocking factor / misalignment. It's also possible there is a leak somewhere. Thinking about it a bit more I remember different versions of the same package are often downloaded "sequentially" (not in the strict sense; "close to at the same time"). And often the source archives are of a similar size between releases. So randomizing the order would, I think, make this **less** likely to occur. I agree it would be a good idea to lower the number of concurrent downloads - maybe 4 or 8? Finally, I agree about having a better recovery strategy is needed - one way or another.
Author
Owner

I think we need a rethink about failure behaviour and (recoverable) errors...

There's not only the swapfs, but then we have mimic errors "no connection found" (which I guess we should retry to establish a connection); and then we have HTTP error codes and bad checksums.

Now, taking a big download into consideration, we should be careful to retry it over and over in case there's a bad checksum (or another error, swapfs comes to mind).

So, the failure behaviour, we should really be careful to also work on small pipes and avoid unnecessary retrying of something that is prone to failure... we could for e.g. bad checksum always remember the last-modified and etag and send the right http headers

I think we need a rethink about failure behaviour and (recoverable) errors... There's not only the swapfs, but then we have mimic errors "no connection found" (which I guess we should retry to establish a connection); and then we have HTTP error codes and bad checksums. Now, taking a big download into consideration, we should be careful to retry it over and over in case there's a bad checksum (or another error, swapfs comes to mind). So, the failure behaviour, we should really be careful to also work on small pipes and avoid unnecessary retrying of something that is prone to failure... we could for e.g. bad checksum always remember the last-modified and etag and send the right http headers
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: robur/opam-mirror#19
No description provided.