Increasing retries does not work as expected

Description

Let’s say my connection to S3 is not super reliable, and I configure something like:

Basically, hoping that enough retries will succeed at some point.

It looks like the agentCheckup routine doesn’t use the same retry options. If connection to S3 fails in the middle, I get this in the log, with storage.s3.debugLogLevels="RequestRetries":

Upload keeps being retried, but agentCheckup only ever gets to 3 retries.

Since agentCheckup fails, the backup gets marked as “stuck” in pbm status:

Even if connection to S3 returns, the backup seems to not recover and remains “stuck” until next backup is started.

Environment

None

AFFECTED CS IDs

CS0045467 CS0048559

Attachments

1

Activity

Sami Ahlroos August 16, 2024 at 11:21 AM

Hi Sandra,

Thanks a lot for pointing out the read “connection reset” thing. Makes perfect sense and explains the latest failure.

Some background: I was trying to come up with a way to reliably reproduce the issue in , which is when I got the “backup stuck” situation. I don’t have the log from back then, and haven’t been able to reproduce it any more.

It is possible that something else happened that caused the backup to be “stuck”, and I missed that.

Since I haven’t been able to reproduce it, I’d say we are good to close this and re-open if something new comes up.

Sandra Romanchenko August 15, 2024 at 3:52 PM
Edited

Hi

I do have some questions in order to be able to reproduce the issue, so your assistance would be very helpful.

  • Regarding the initial issue, agentCheckup procedure and the number of applied retries there should have nothing to do with backup marked as stuck, usually this message appears if for some reason agent is lost during backup (for instance, agent crash). Is it possible that there were any PBM crashes or process was stopped during the backup?

  • Is this issue happening on customer’s setup or are all those logs from your lab setup? Anyhow, for both cases can you please provide what exactly storage is used (like AWS, minio, etc.) and if the issue is reproducible on your lab setup the exact STR for this.

  • As for the number of retry attempts, the fact that PBM still retries to upload parts some time after connection to storage is restored is actually OK, however your log doesn’t show that those attempts fail. Can you please set debugLogLevels to RequestRetries,RequestErrors for more info and provide a fresh log? First of all, it will clearly show if the number of attempts set in config is applied, but also if the actual attempt was successful. For instance,

TIA>

Sami Ahlroos August 15, 2024 at 11:52 AM

Full agent configuration:

The units for retryer are shown when using config --list. In config file I had:

Huge numbers because of , not related to issue at hand.

Interestingly, using version 2.5.0 I was not able to reproduce the “backup stuck” situation that I got with 2.4.1.

Instead, I get a failed backup, before reaching 20 retries that have been configured:

I’ll attach full log, but the part that shows a problem (I believe) is:

S3 server became reachable at about 11:30:58, so the agentCheckup retries stopped. However, s3/UploadPart keeps retrying and failing, it never gets to 20 retries but fails at “attempt 5” according to the log.

radoslaw.szulgo August 13, 2024 at 12:58 PM

→ provide precise configuration fragment for the retryer - Does it really contain “units”?

→ Can you provide more detailed logs or a whole file?

radoslaw.szulgo August 13, 2024 at 12:33 PM

 

docs:

retryer.minRetryDelay

Type: time.Duration
Required: NO
Default: 30

The minimum time (in ms) to wait till the next retry. Available in Percona Backup for MongoDB as of 1.7.0.

retryer.maxRetryDelay

Type: time.Duration
Required: NO
Default: 5

The maximum time (in minutes) to wait till the next retry. Available in Percona Backup for MongoDB as of 1.7.0.

Cannot Reproduce

Details

Assignee

Reporter

Needs QA

Yes

Components

Sprint

Affects versions

Priority

Smart Checklist

Created May 21, 2024 at 12:55 PM
Updated March 24, 2025 at 12:02 PM
Resolved August 16, 2024 at 11:28 AM