backup error - starting deadline exceeded
Description
Environment
AFFECTED CS IDs
Smart Checklist
Activity
Tomislav Plavcic August 19, 2022 at 8:20 AM
I have tried to reproduce it but without success. I think for now this is ok with the fix for timeout and we will see how it goes in future.
Checked with operator version:
2022-08-19T08:11:21.174Z INFO setup Git commit: 986b4579634babe559a9a2ef8a8524a30645140b Git branch: main
ege.gunes August 1, 2022 at 11:07 AM
PBM initializes the backup with status "starting" but if nodes can't connect to storage no one really starts the backup and backup is stuck with "starting". PBM cli has a check to mark backup as failed if it doesn't start after some time (33 seconds). There is also PBM-673 to have the same logic in PBM itself but no one touched the ticket for a long time.
Unfortunately I couldn't reproduce the error with many nodes (i.e 5 shards with 5 pods in each) but only with wrong storage credentials. @Tomislav Plavcic if you manage to reproduce, we can also try with PBM cli itself since it should fail too. It may help us to prioritize https://perconadev.atlassian.net/browse/PBM-673#icft=PBM-673.
I believe we can merge @Former user's PR to increase this timeout to 120 seconds. Since the main goal of the check was inform user there is something wrong with their backup if it doesn't start for some time.
@Slava Sarzhan
Former user July 14, 2022 at 9:30 AM
would it make sense to increase the deadline to 120 seconds? I've sent a merge request regarding this increase pbmStartingDeadline from 40s to 120s by wmvfw · Pull Request #976 · percona/percona-server-mongodb-operator (github.com)
Details
Assignee
ege.gunesege.gunesReporter
Tomislav PlavcicTomislav PlavcicLabels
Needs QA
YesFix versions
Affects versions
Priority
Medium
Details
Details
Assignee
Reporter
Labels
Needs QA
Fix versions
Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

When starting a backup it can happen that the deadline to start a backup can be exceeded especially in the sharding environment because of many nodes.
We have set some deadline of 40sec and if the backup is not started by then it will be marked as failed (although it can actually succeed).
This deadline is set here: https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/controller/perconaservermongodbbackup/backup.go#L15
The error can look like this:
apiVersion: psmdb.percona.com/v1 kind: PerconaServerMongoDBBackup metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"psmdb.percona.com/v1","kind":"PerconaServerMongoDBBackup","metadata":{"annotations":{},"name":"backup1","namespace":"mongo"},"spec":{"psmdbCluster":"my-cluster-name","storageName":"s3-eu-west-2"}} creationTimestamp: "2022-03-21T09:23:12Z" generation: 1 name: backup1 namespace: mongo resourceVersion: "2189522" uid: 89007408-3439-4b52-a406-af71643f7b40 spec: psmdbCluster: my-cluster-name storageName: s3-eu-west-2 status: azure: credentialsSecret: "" destination: "2022-03-21T09:23:33Z" error: starting deadline exceeded lastTransition: "2022-03-21T09:23:34Z" pbmName: "2022-03-21T09:23:33Z" s3: bucket: <bucket> credentialsSecret: my-cluster-name-backup-s3 region: eu-west-2 start: "2022-03-21T09:23:34Z" state: error storageName: s3-eu-west-2