swapfs error handling: not enough space

hannes commented

2024-11-13 12:21:23 +00:00

Owner

When this error occurs, we should re-schedule the download with fewer parallel downloads -- or have some other recovery strategy. It happens with an empty data file and 20 parallel downloads on our server (with default sizes of partitionings).

reynir commented

2024-11-18 10:52:14 +00:00

Owner

Hmm! I would expect 1 GB to be enough (assuming defaults). Then again, that's "only" about 50 MiB per download task. Then there's up to 20 MB of unusable storage due to the blocking factor / misalignment. It's also possible there is a leak somewhere.

Thinking about it a bit more I remember different versions of the same package are often downloaded "sequentially" (not in the strict sense; "close to at the same time"). And often the source archives are of a similar size between releases. So randomizing the order would, I think, make this less likely to occur.

I agree it would be a good idea to lower the number of concurrent downloads - maybe 4 or 8?

Finally, I agree about having a better recovery strategy is needed - one way or another.

Hmm! I would expect 1 GB to be enough (assuming defaults). Then again, that's "only" about 50 MiB per download task. Then there's up to 20 MB of unusable storage due to the blocking factor / misalignment. It's also possible there is a leak somewhere. Thinking about it a bit more I remember different versions of the same package are often downloaded "sequentially" (not in the strict sense; "close to at the same time"). And often the source archives are of a similar size between releases. So randomizing the order would, I think, make this **less** likely to occur. I agree it would be a good idea to lower the number of concurrent downloads - maybe 4 or 8? Finally, I agree about having a better recovery strategy is needed - one way or another.

hannes commented

2024-11-19 13:59:03 +00:00

Author

Owner

I think we need a rethink about failure behaviour and (recoverable) errors...

There's not only the swapfs, but then we have mimic errors "no connection found" (which I guess we should retry to establish a connection); and then we have HTTP error codes and bad checksums.

Now, taking a big download into consideration, we should be careful to retry it over and over in case there's a bad checksum (or another error, swapfs comes to mind).

So, the failure behaviour, we should really be careful to also work on small pipes and avoid unnecessary retrying of something that is prone to failure... we could for e.g. bad checksum always remember the last-modified and etag and send the right http headers

I think we need a rethink about failure behaviour and (recoverable) errors... There's not only the swapfs, but then we have mimic errors "no connection found" (which I guess we should retry to establish a connection); and then we have HTTP error codes and bad checksums. Now, taking a big download into consideration, we should be careful to retry it over and over in case there's a bad checksum (or another error, swapfs comes to mind). So, the failure behaviour, we should really be careful to also work on small pipes and avoid unnecessary retrying of something that is prone to failure... we could for e.g. bad checksum always remember the last-modified and etag and send the right http headers

👍 1

swapfs error handling: not enough space #19