perconapgbackup could stale forever in Starting
General
Escalation
General
Escalation
Description
Environment
None
AFFECTED CS IDs
CS0050549
Activity
Nickolay Ihalainen November 8, 2024 at 5:19 PM
@Slava Sarzhan yes, I was able to fully reproduce the case, but want to keep this issue just for log messages spam: https://perconadev.atlassian.net/browse/K8SPG-680
$ kubectl -n pgo logs percona-postgresql-operator-6547b6774c-j9zfb |grep -c 'Waiting for backup to start'
226
$ kubectl -n pgo logs percona-postgresql-operator-6547b6774c-j9zfb |grep -m 1 'Waiting for backup to start'
2024-11-08T17:00:04.034Z INFO Waiting for backup to start {"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"cluster1-repo1-full-qvv9r","namespace":"pgo"}, "namespace": "pgo", "name": "cluster1-repo1-full-qvv9r", "reconcileID": "44fcb781-fd0f-4bc1-afbb-76c3a231e65d", "request": {"name":"cluster1-repo1-full-qvv9r","namespace":"pgo"}}
$ kubectl -n pgo logs percona-postgresql-operator-6547b6774c-j9zfb |tail -n 1
2024-11-08T17:19:14.266Z INFO Waiting for backup to start {"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"cluster1-repo1-full-qvv9r","namespace":"pgo"}, "namespace": "pgo", "name": "cluster1-repo1-full-qvv9r", "reconcileID": "489b5d65-5ec6-4c09-94bc-2fc1d858193e", "request": {"name":"cluster1-repo1-full-qvv9r","namespace":"pgo"}}
Slava Sarzhan November 4, 2024 at 2:07 PM
@Nickolay Ihalainen can you reproduce it with PGO 2.5.0?
Details
Assignee
UnassignedUnassignedReporter
Nickolay IhalainenNickolay Ihalainen(Deactivated)Needs QA
YesAffects versions
Priority
Medium
Details
Details
Assignee
Unassigned
UnassignedReporter
Nickolay Ihalainen
Nickolay Ihalainen(Deactivated)Needs QA
Yes
Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

Open Smart Checklist
Created November 1, 2024 at 2:42 PM
Updated November 8, 2024 at 5:19 PM
It’s unclear why on a production system there are 4d23h
spec: pgCluster: cluster1 repoName: repo1 status: crVersion: 2.2.0 image: postgres-operator/percona-postgresql-operator:2.2.0-ppg15-pgbackrest repo: name: repo1 schedules: full: 0 13 * * 0 incremental: 0 7 * * 2,4 volume: volumeClaimSpec: accessModes: - ReadWriteOnce resources: requests: storage: 400Gi storageClassName: standard-ssd state: Starting storageType: filesystem
I can’t reproduce why the Job wasn’t created, but could simulate the behavior with:
for i in {1..1000} ; do sed -e 's/1/'$i/ deploy/backup.yaml ; done for bkp in $(kubectl -n pgo get perconapgbackup|grep ^backup|awk '{print $1}') ; do kubectl -n pgo patch perconapgbackup $bkp --type='merge' -p '{"spec": {"pgCluster": "cluster1"}}' ; done for bkp in $(kubectl -n pgo get perconapgbackup|grep ^backup|awk '{print $1}') ; do kubectl -n pgo patch --subresource=status perconapgbackup $bkp --type='merge' -p '{"status": {"state": "Starting"}}' ; done
It leads to infinite stream of messages:
time="2024-11-01T14:21:25Z" level=info msg="Waiting for backup to start" PerconaPGBackup=pgo/backup332 controller=perconapgbackup controllerGroup=pgv2.percona.com controllerKind=PerconaPGBackup name=backup332 namespace=pgo reconcileID=2079ca64-eedb-4322-b8c1-4ea47ca4e49c request=pgo/backup332 version= time="2024-11-01T14:21:25Z" level=info msg="Waiting for backup to start" PerconaPGBackup=pgo/backup220 controller=perconapgbackup controllerGroup=pgv2.percona.com controllerKind=PerconaPGBackup name=backup220 namespace=pgo reconcileID=f1785dab-f4b1-4e99-947e-ab019d56c751 request=pgo/backup220 version= time="2024-11-01T14:21:25Z" level=info msg="Waiting for backup to start" PerconaPGBackup=pgo/backup250 controller=perconapgbackup controllerGroup=pgv2.percona.com controllerKind=PerconaPGBackup name=backup250 namespace=pgo reconcileID=44c055f7-b474-4505-84d1-c79db4b4aaad request=pgo/backup250 version=
The expected behavior: the operator should have a “timeout” for creating the job and re-try with job creation instead of infinite waiting for backup to start messages.