backup error - starting deadline exceeded

General

Escalation

General

Escalation

Description

When starting a backup it can happen that the deadline to start a backup can be exceeded especially in the sharding environment because of many nodes.
We have set some deadline of 40sec and if the backup is not started by then it will be marked as failed (although it can actually succeed).
This deadline is set here: https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/controller/perconaservermongodbbackup/backup.go#L15

The error can look like this:

apiVersion: psmdb.percona.com/v1
kind: PerconaServerMongoDBBackup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"psmdb.percona.com/v1","kind":"PerconaServerMongoDBBackup","metadata":{"annotations":{},"name":"backup1","namespace":"mongo"},"spec":{"psmdbCluster":"my-cluster-name","storageName":"s3-eu-west-2"}}
  creationTimestamp: "2022-03-21T09:23:12Z"
  generation: 1
  name: backup1
  namespace: mongo
  resourceVersion: "2189522"
  uid: 89007408-3439-4b52-a406-af71643f7b40
spec:
  psmdbCluster: my-cluster-name
  storageName: s3-eu-west-2
status:
  azure:
    credentialsSecret: ""
  destination: "2022-03-21T09:23:33Z"
  error: starting deadline exceeded
  lastTransition: "2022-03-21T09:23:34Z"
  pbmName: "2022-03-21T09:23:33Z"
  s3:
    bucket: <bucket>
    credentialsSecret: my-cluster-name-backup-s3
    region: eu-west-2
  start: "2022-03-21T09:23:34Z"
  state: error
  storageName: s3-eu-west-2

Environment

None

AFFECTED CS IDs

CS0028208

Linked issues

relates to

K8SPSMDB-638

backup can fail with "starting deadline exceeded" even if it finishes in PBM

Smart Checklist

Activity

Tomislav Plavcic August 19, 2022 at 8:20 AM

I have tried to reproduce it but without success. I think for now this is ok with the fix for timeout and we will see how it goes in future.
Checked with operator version:

2022-08-19T08:11:21.174Z        INFO    setup   Git commit: 986b4579634babe559a9a2ef8a8524a30645140b Git branch: main

ege.gunes August 1, 2022 at 11:07 AM

PBM initializes the backup with status "starting" but if nodes can't connect to storage no one really starts the backup and backup is stuck with "starting". PBM cli has a check to mark backup as failed if it doesn't start after some time (33 seconds). There is also PBM-673 to have the same logic in PBM itself but no one touched the ticket for a long time.

Unfortunately I couldn't reproduce the error with many nodes (i.e 5 shards with 5 pods in each) but only with wrong storage credentials. @Tomislav Plavcic if you manage to reproduce, we can also try with PBM cli itself since it should fail too. It may help us to prioritize https://perconadev.atlassian.net/browse/PBM-673#icft=PBM-673.

I believe we can merge @Former user's PR to increase this timeout to 120 seconds. Since the main goal of the check was inform user there is something wrong with their backup if it doesn't start for some time.

@Slava Sarzhan

Former user July 14, 2022 at 9:30 AM

would it make sense to increase the deadline to 120 seconds? I've sent a merge request regarding this increase pbmStartingDeadline from 40s to 120s by wmvfw · Pull Request #976 · percona/percona-server-mongodb-operator (github.com)

Done

Details
Assignee
ege.gunes
Reporter
Tomislav Plavcic
Labels
discover-qa
Needs QA
Yes
Fix versions
1.13.0
Affects versions
1.11.0
Priority
Medium

Smart Checklist

Created March 21, 2022 at 11:45 AM

Updated March 5, 2024 at 4:42 PM

Resolved August 19, 2022 at 8:20 AM

backup error - starting deadline exceeded

Description

Environment

AFFECTED CS IDs

Linked issues

relates to

Smart Checklist

Activity

Tomislav Plavcic August 19, 2022 at 8:20 AM

ege.gunes August 1, 2022 at 11:07 AM

Former user July 14, 2022 at 9:30 AM

Details
Assignee
ege.gunes
Reporter
Tomislav Plavcic
Labels
discover-qa
Needs QA
Yes
Fix versions
1.13.0
Affects versions
1.11.0
Priority
Medium

Details

Assignee

Reporter

Labels

Needs QA

Fix versions

Affects versions

Priority

Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

backup error - starting deadline exceeded

Description

Environment

AFFECTED CS IDs

Linked issues

relates to

Smart Checklist

Activity

Tomislav Plavcic August 19, 2022 at 8:20 AM

ege.gunes August 1, 2022 at 11:07 AM

Former user July 14, 2022 at 9:30 AM

DetailsAssigneeege.gunesege.gunesReporterTomislav PlavcicTomislav PlavcicLabelsdiscover-qaNeeds QAYesFix versions1.13.0Affects versions1.11.0PriorityMedium

Details

Assignee

Reporter

Labels

Needs QA

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
ege.gunes
Reporter
Tomislav Plavcic
Labels
discover-qa
Needs QA
Yes
Fix versions
1.13.0
Affects versions
1.11.0
Priority
Medium

Smart Checklist