perconapgbackup could stale forever in Starting

Description

It’s unclear why on a production system there are 4d23h

spec: pgCluster: cluster1 repoName: repo1 status: crVersion: 2.2.0 image: postgres-operator/percona-postgresql-operator:2.2.0-ppg15-pgbackrest repo: name: repo1 schedules: full: 0 13 * * 0 incremental: 0 7 * * 2,4 volume: volumeClaimSpec: accessModes: - ReadWriteOnce resources: requests: storage: 400Gi storageClassName: standard-ssd state: Starting storageType: filesystem

 

I can’t reproduce why the Job wasn’t created, but could simulate the behavior with:

for i in {1..1000} ; do sed -e 's/1/'$i/ deploy/backup.yaml ; done for bkp in $(kubectl -n pgo get perconapgbackup|grep ^backup|awk '{print $1}') ; do kubectl -n pgo patch perconapgbackup $bkp --type='merge' -p '{"spec": {"pgCluster": "cluster1"}}' ; done for bkp in $(kubectl -n pgo get perconapgbackup|grep ^backup|awk '{print $1}') ; do kubectl -n pgo patch --subresource=status perconapgbackup $bkp --type='merge' -p '{"status": {"state": "Starting"}}' ; done

It leads to infinite stream of messages:

time="2024-11-01T14:21:25Z" level=info msg="Waiting for backup to start" PerconaPGBackup=pgo/backup332 controller=perconapgbackup controllerGroup=pgv2.percona.com controllerKind=PerconaPGBackup name=backup332 namespace=pgo reconcileID=2079ca64-eedb-4322-b8c1-4ea47ca4e49c request=pgo/backup332 version= time="2024-11-01T14:21:25Z" level=info msg="Waiting for backup to start" PerconaPGBackup=pgo/backup220 controller=perconapgbackup controllerGroup=pgv2.percona.com controllerKind=PerconaPGBackup name=backup220 namespace=pgo reconcileID=f1785dab-f4b1-4e99-947e-ab019d56c751 request=pgo/backup220 version= time="2024-11-01T14:21:25Z" level=info msg="Waiting for backup to start" PerconaPGBackup=pgo/backup250 controller=perconapgbackup controllerGroup=pgv2.percona.com controllerKind=PerconaPGBackup name=backup250 namespace=pgo reconcileID=44c055f7-b474-4505-84d1-c79db4b4aaad request=pgo/backup250 version=

 

The expected behavior: the operator should have a “timeout” for creating the job and re-try with job creation instead of infinite waiting for backup to start messages.

Environment

None

AFFECTED CS IDs

CS0050549

Activity

Nickolay Ihalainen November 8, 2024 at 5:19 PM

yes, I was able to fully reproduce the case, but want to keep this issue just for log messages spam: https://perconadev.atlassian.net/browse/K8SPG-680

 

$ kubectl -n pgo logs percona-postgresql-operator-6547b6774c-j9zfb |grep -c 'Waiting for backup to start' 226 $ kubectl -n pgo logs percona-postgresql-operator-6547b6774c-j9zfb |grep -m 1 'Waiting for backup to start' 2024-11-08T17:00:04.034Z INFO Waiting for backup to start {"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"cluster1-repo1-full-qvv9r","namespace":"pgo"}, "namespace": "pgo", "name": "cluster1-repo1-full-qvv9r", "reconcileID": "44fcb781-fd0f-4bc1-afbb-76c3a231e65d", "request": {"name":"cluster1-repo1-full-qvv9r","namespace":"pgo"}} $ kubectl -n pgo logs percona-postgresql-operator-6547b6774c-j9zfb |tail -n 1 2024-11-08T17:19:14.266Z INFO Waiting for backup to start {"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"cluster1-repo1-full-qvv9r","namespace":"pgo"}, "namespace": "pgo", "name": "cluster1-repo1-full-qvv9r", "reconcileID": "489b5d65-5ec6-4c09-94bc-2fc1d858193e", "request": {"name":"cluster1-repo1-full-qvv9r","namespace":"pgo"}}

Slava Sarzhan November 4, 2024 at 2:07 PM

can you reproduce it with PGO 2.5.0?

Details

Assignee

Reporter

Needs QA

Yes

Affects versions

Priority

Smart Checklist

Created November 1, 2024 at 2:42 PM
Updated November 8, 2024 at 5:19 PM

Flag notifications