Backup Jobs Fail Intermittently
Description
Environment
Attachments
Smart Checklist
Activity

Slava Sarzhan January 31, 2022 at 8:03 PM
Hi ,
In the next release we will use 8.0.26-18 version of PXB.
Dustin Falgout December 9, 2021 at 6:51 PM

Sergey Pronin December 9, 2021 at 8:20 AM
looks like we figured out the issue.
In xbcloud 8.0.26 we introduced exponential backoff feature:
https://www.percona.com/doc/percona-xtrabackup/8.0/xbcloud/xbcloud_exbackoff.html
In Operator we are still on 8.0.25 version, where there is no backoff, so we are constantly retrying the upload spamming S3.
The next step here would be to use 8.0.26. We will discuss it internally and see how we can deliver this.
Dustin Falgout December 8, 2021 at 9:06 PM
I had to set parallel to 1 in order to stop jobs from hitting rate limit and failing. Its working fine now. I don't understand why the backup agent is making multiple put requests to the same object at the same time when parallel is set to anything other than 1. That doesn't seem right. In any case if this issue only affects some storage providers and not all providers then I think the best solution is to make the parallel argument configurable in the yaml file. Thoughts?
Dustin Falgout December 7, 2021 at 1:35 AM
I contacted DigitalOcean about this and here's what they said:
Hi Dustin,
Thanks for reaching out about this. I understand that you have been experiencing throttling on your Space. Looking into it, in one example I see repeated PUT attempts for this file multiple times a second:
/db-cluster-1-2021-12-06-08%3A00%3A00-full/cloud/dc_121_comments.ibd.lz4.00000000000000000001
I apologize that you are encountering those issues. We have concurrent PUT limit for the same object key (excluding multi-part upload) and the maximum concurrent limit is 2. That might be the reason your PUT request is failing intermittently.
For example: When you send one PUT request to /spaces/folder-name and it's routed by the first load balancer to process this request. Simultaneously, you are sending another PUT request to /spaces/folder-name then there is a chance that request hits different load balancer and the request is accepted ( or rejected if it again routed to first load balancer). However, the third concurrent request will be rejected until the first request is processed.
So, I would recommend you to optimize your request and make sure only one PUT request is sent at a time.
Let us know if you have any other questions.
Why would the backup script be making put request to the same object multiple times per second?
I see some rate limit errors in the logs but it's not clear to me that is the cause of the failure. Here are the output from the most recent failure: