Backup Jobs Fail Intermittently

General

Escalation

General

Escalation

Description

I see some rate limit errors in the logs but it's not clear to me that is the cause of the failure. Here are the output from the most recent failure:

Environment

None

Attachments

Smart Checklist

Activity

Slava Sarzhan January 31, 2022 at 8:03 PM

Hi ,

In the next release we will use 8.0.26-18 version of PXB.

Dustin Falgout December 9, 2021 at 6:51 PM

Sergey Pronin December 9, 2021 at 8:20 AM

looks like we figured out the issue.

In xbcloud 8.0.26 we introduced exponential backoff feature:

https://github.com/percona/percona-xtrabackup/blob/percona-xtrabackup-8.0.25-17/storage/innobase/xtrabackup/src/xbcloud/s3.cc#L569

https://www.percona.com/doc/percona-xtrabackup/8.0/xbcloud/xbcloud_exbackoff.html

In Operator we are still on 8.0.25 version, where there is no backoff, so we are constantly retrying the upload spamming S3.

The next step here would be to use 8.0.26. We will discuss it internally and see how we can deliver this.

Dustin Falgout December 8, 2021 at 9:06 PM

I had to set parallel to 1 in order to stop jobs from hitting rate limit and failing. Its working fine now. I don't understand why the backup agent is making multiple put requests to the same object at the same time when parallel is set to anything other than 1. That doesn't seem right. In any case if this issue only affects some storage providers and not all providers then I think the best solution is to make the parallel argument configurable in the yaml file. Thoughts?

Dustin Falgout December 7, 2021 at 1:35 AM

I contacted DigitalOcean about this and here's what they said:

Hi Dustin,
Thanks for reaching out about this. I understand that you have been experiencing throttling on your Space. Looking into it, in one example I see repeated PUT attempts for this file multiple times a second:
/db-cluster-1-2021-12-06-08%3A00%3A00-full/cloud/dc_121_comments.ibd.lz4.00000000000000000001
I apologize that you are encountering those issues. We have concurrent PUT limit for the same object key (excluding multi-part upload) and the maximum concurrent limit is 2. That might be the reason your PUT request is failing intermittently.
For example: When you send one PUT request to /spaces/folder-name and it's routed by the first load balancer to process this request. Simultaneously, you are sending another PUT request to /spaces/folder-name then there is a chance that request hits different load balancer and the request is accepted ( or rejected if it again routed to first load balancer). However, the third concurrent request will be rejected until the first request is processed.
So, I would recommend you to optimize your request and make sure only one PUT request is sent at a time.
Let us know if you have any other questions.

Why would the backup script be making put request to the same object multiple times per second?

Done

Details
Assignee
Slava Sarzhan
Reporter
Dustin Falgout
Fix versions
1.11.0
Affects versions
1.9.0
1.10.0
Priority
Medium

Smart Checklist

Created November 30, 2021 at 10:40 PM

Updated March 5, 2024 at 5:42 PM

Resolved June 9, 2022 at 6:50 AM

Backup Jobs Fail Intermittently

Description

Environment

Attachments

Smart Checklist

Activity

Slava Sarzhan January 31, 2022 at 8:03 PM

Dustin Falgout December 9, 2021 at 6:51 PM

Sergey Pronin December 9, 2021 at 8:20 AM

Dustin Falgout December 8, 2021 at 9:06 PM

Dustin Falgout December 7, 2021 at 1:35 AM

DetailsAssigneeSlava SarzhanSlava SarzhanReporterDustin FalgoutDustin FalgoutFix versions1.11.0Affects versions1.9.01.10.0PriorityMedium

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Details
Assignee
Slava Sarzhan
Reporter
Dustin Falgout
Fix versions
1.11.0
Affects versions
1.9.0
1.10.0
Priority
Medium

Smart Checklist