Reconcilier fails with "job.batch not found" when using ttlSecondsAfterFinished
General
Escalation
General
Escalation
Description
When using ttlSecondsAfterFinished, it seems there is a chance of a race condition where the jobs is deleted while the operator did not have enough time to reconcile the perconapgbackups object.
It does not only happen with unreasonably short ttlSecondsAfterFinished, but even with 1m, 5m or even 30m timeouts.
When using ttlSecondsAfterFinished, it seems there is a chance of a race condition where the jobs is deleted while the operator did not have enough time to reconcile the perconapgbackups object.
It does not only happen with unreasonably short ttlSecondsAfterFinished, but even with 1m, 5m or even 30m timeouts.
2025-01-16T14:03:55.705Z ERROR Reconciler error {"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"cluster1-repo1-incr-bd4dm","namespace":"pgottl"}, "namespace": "pgottl", "name": "cluster1-repo1-incr-bd4dm", "reconcileID": "2e801810-d779-495f-b6d2-adbc55fc2397", "error": "get backup job: Job.batch \"cluster1-backup-2nff\" not found", "errorVerbose": "Job.batch \"cluster1-backup-2nff\" not found\nget backup job\ngithub.com/percona/percona-postgresql-operator/percona/controller/pgbackup.(*PGBackupReconciler).Reconcile\n\t/go/src/github.com/percona/percona-postgresql-operator/percona/controller/pgbackup/controller.go:189\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:116\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:303\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.19.0/pkg/internal/controller/controller.go:224\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"}
The perconapgbackups stays Running forever, blocking the subsequent backups
cluster1-repo1-incr-bm59k cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-bd4dm cluster1 repo1 Running incr 17h cluster1-repo1-incr-btdrw cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-7szgg cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-cxx7s cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-cl8zm cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-w6789 cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-qtp77 cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-79wrx cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-b8s59 cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-nphv5 cluster1 repo1 Succeeded incr 17h 17h cluster1-repo1-incr-rqzqv cluster1 repo1 Succeeded incr 16h 17h cluster1-repo1-incr-rp9lx cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-88wld cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-tbq84 cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-p8x7j cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-dbbhk cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-k4th2 cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-hljhd cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-w6xgw cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-vjgqn cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-rlht9 cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-vrzls cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-84l77 cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-876dg cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-4vl9d cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-vm582 cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-xz5dz cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-j292b cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-6p5sx cluster1 repo1 Running incr 16h cluster1-repo1-incr-qd28s cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-w9pwb cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-mbpcg cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-4k78q cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-zwtgc cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-22zlh cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-kgdhb cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-fgp2s cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-44qdr cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-hvghk cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-rdw5n cluster1 repo1 Succeeded incr 16h 16h cluster1-repo1-incr-b69sz cluster1 repo1 Succeeded incr 15h 16h cluster1-repo1-incr-jfsmv cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-n7tvb cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-8qp7k cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-c2cct cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-vthqg cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-4bvtr cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-l5rn9 cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-fq762 cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-c4f6x cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-b6nmr cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-ckjvd cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-mn5ff cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-4dqvr cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-ql6td cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-9bnq9 cluster1 repo1 Succeeded incr 15h 15h cluster1-repo1-incr-7r87c cluster1 repo1 Running incr 15h cluster1-repo1-incr-8hf6c cluster1 repo1 15h cluster1-repo1-incr-84xcb cluster1 repo1 15h cluster1-repo1-incr-hpxqd cluster1 repo1 15h cluster1-repo1-incr-thzp2 cluster1 repo1 15h cluster1-repo1-incr-7gqc9 cluster1 repo1 15h cluster1-repo1-incr-m59zs cluster1 repo1 15h cluster1-repo1-incr-5cvkq cluster1 repo1 15h cluster1-repo1-incr-cwrjn cluster1 repo1 15h cluster1-repo1-incr-knknf cluster1 repo1 15h cluster1-repo1-incr-64hpr cluster1 repo1 15h cluster1-repo1-incr-z9v6q cluster1 repo1 15h cluster1-repo1-incr-nktrj cluster1 repo1 15h cluster1-repo1-incr-6tvfl cluster1 repo1 15h cluster1-repo1-incr-8kxlb cluster1 repo1 15h cluster1-repo1-incr-nzqqx cluster1 repo1 14h cluster1-repo1-incr-2pmm6 cluster1 repo1 14h cluster1-repo1-incr-skvhb cluster1 repo1 14h cluster1-repo1-incr-swqsz cluster1 repo1 14h cluster1-repo1-incr-4hh6p cluster1 repo1 14h cluster1-repo1-incr-8xxbq cluster1 repo1 14h cluster1-repo1-incr-kkzs4 cluster1 repo1 14h cluster1-repo1-incr-p445z cluster1 repo1 14h cluster1-repo1-incr-69c2l cluster1 repo1 14h cluster1-repo1-incr-d9bzx cluster1 repo1 14h cluster1-repo1-incr-j69cp cluster1 repo1 14h cluster1-repo1-incr-6rtlg cluster1 repo1 14h cluster1-repo1-incr-54wg6 cluster1 repo1 14h cluster1-repo1-incr-fkqk5 cluster1 repo1 14h
(not sure how 3 of them ended up Running, some logs have rotated unforutnatley)
When getting
"error": "get backup job: Job.batch \"cluster1-backup-2nff\" not found"
it ends stuck right at the beginning of the loop
case v2.BackupRunning: job := &batchv1.Job{} err := r.Client.Get(ctx, types.NamespacedName{Name: pgBackup.Status.JobName, Namespace: pgBackup.Namespace}, job) if err != nil { -> return reconcile.Result{}, errors.Wrap(err, "get backup job") } status := checkBackupJob(job) switch status { case v2.BackupFailed: log.Info("Backup failed") case v2.BackupSucceeded: log.Info("Backup succeeded") default: log.Info("Waiting for backup to complete") return reconcile.Result{RequeueAfter: time.Second * 5}, nil } rr, err := finishBackup(ctx, r.Client, pgBackup, job) if err != nil { return reconcile.Result{}, err }
I would argue there should be a mechanism to drop automatically this kind of stale perconapgbackups.
I would assume it to be failed and retry the backup (or let the next one run, if there is already), but here this prevent backups from running.
It is also unclear to me why it can miss some jobs even with 30m of ttl
How to reproduce
anydbver deploy k3d k8s-pg:2.5.0 k8s-pg:2.5.0,namespace=pgottl
pgbackrest: image: percona/percona-postgresql-operator:2.5.0-ppg16.4-pgbackrest2.53-1 jobs: resources: {} ttlSecondsAfterFinished: 60 ... repos: - name: repo1 schedules: full: 0 0 * * 6 incremental: '*/2 * * * *'
After some time, I end up reproducing this issue.