backup error - starting deadline exceeded

Description

When starting a backup it can happen that the deadline to start a backup can be exceeded especially in the sharding environment because of many nodes.
We have set some deadline of 40sec and if the backup is not started by then it will be marked as failed (although it can actually succeed).
This deadline is set here: https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/controller/perconaservermongodbbackup/backup.go#L15

The error can look like this:

apiVersion: psmdb.percona.com/v1 kind: PerconaServerMongoDBBackup metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"psmdb.percona.com/v1","kind":"PerconaServerMongoDBBackup","metadata":{"annotations":{},"name":"backup1","namespace":"mongo"},"spec":{"psmdbCluster":"my-cluster-name","storageName":"s3-eu-west-2"}} creationTimestamp: "2022-03-21T09:23:12Z" generation: 1 name: backup1 namespace: mongo resourceVersion: "2189522" uid: 89007408-3439-4b52-a406-af71643f7b40 spec: psmdbCluster: my-cluster-name storageName: s3-eu-west-2 status: azure: credentialsSecret: "" destination: "2022-03-21T09:23:33Z" error: starting deadline exceeded lastTransition: "2022-03-21T09:23:34Z" pbmName: "2022-03-21T09:23:33Z" s3: bucket: <bucket> credentialsSecret: my-cluster-name-backup-s3 region: eu-west-2 start: "2022-03-21T09:23:34Z" state: error storageName: s3-eu-west-2

Environment

None

AFFECTED CS IDs

CS0028208

Smart Checklist

Activity

Tomislav Plavcic August 19, 2022 at 8:20 AM

I have tried to reproduce it but without success. I think for now this is ok with the fix for timeout and we will see how it goes in future.
Checked with operator version:

2022-08-19T08:11:21.174Z INFO setup Git commit: 986b4579634babe559a9a2ef8a8524a30645140b Git branch: main

ege.gunes August 1, 2022 at 11:07 AM

PBM initializes the backup with status "starting" but if nodes can't connect to storage no one really starts the backup and backup is stuck with "starting". PBM cli has a check to mark backup as failed if it doesn't start after some time (33 seconds). There is also PBM-673 to have the same logic in PBM itself but no one touched the ticket for a long time. 

Unfortunately I couldn't reproduce the error with many nodes (i.e 5 shards with 5 pods in each) but only with wrong storage credentials. if you manage to reproduce, we can also try with PBM cli itself since it should fail too. It may help us to prioritize https://perconadev.atlassian.net/browse/PBM-673#icft=PBM-673.

I believe we can merge 's PR to increase this timeout to 120 seconds. Since the main goal of the check was inform user there is something wrong with their backup if it doesn't start for some time.

 

Former user July 14, 2022 at 9:30 AM

would it make sense to increase the deadline to 120 seconds? I've sent a merge request regarding this increase pbmStartingDeadline from 40s to 120s by wmvfw · Pull Request #976 · percona/percona-server-mongodb-operator (github.com)

Done

Details

Assignee

Reporter

Needs QA

Yes

Fix versions

Affects versions

Priority

Smart Checklist

Created March 21, 2022 at 11:45 AM
Updated March 5, 2024 at 4:42 PM
Resolved August 19, 2022 at 8:20 AM

Flag notifications