Reconcilier fails with "job.batch not found" when using ttlSecondsAfterFinished

Description

When using ttlSecondsAfterFinished, it seems there is a chance of a race condition where the jobs is deleted while the operator did not have enough time to reconcile the perconapgbackups object.

It does not only happen with unreasonably short ttlSecondsAfterFinished, but even with 1m, 5m or even 30m timeouts.

The perconapgbackups stays Running forever, blocking the subsequent backups

(not sure how 3 of them ended up Running, some logs have rotated unforutnatley)

When getting

it ends stuck right at the beginning of the loop

 

I would argue there should be a mechanism to drop automatically this kind of stale perconapgbackups.

I would assume it to be failed and retry the backup (or let the next one run, if there is already), but here this prevent backups from running.

 

It is also unclear to me why it can miss some jobs even with 30m of ttl

 

How to reproduce

After some time, I end up reproducing this issue.

Environment

None

AFFECTED CS IDs

CS0052230

Activity

Show:

yoann.lacancellera January 20, 2025 at 1:47 PM

I reproduced the same symptom after many days even without using ttl, on a different namespace

I do have a schedule forcing a incr backup every 2 minutes though, on my laptop so the CPU load have been high at times (part of my test)

However it seems it’s a different bug, or maybe is just a limitation ? (I’m really running a crazy schedule with tons of incremental backup on top of 1 full backup)

Details

Assignee

Reporter

Needs QA

Yes

Story Points

Fix versions

Affects versions

Priority

Smart Checklist

Created January 16, 2025 at 3:37 PM
Updated March 19, 2025 at 12:54 PM