This issue occurred in a production environment using 1.6.0. I have not managed to reproduce the initial race condition in 1.8.1, but I have confirmed the resulting inability to delete jobs in a bad state still applies which is what I consider to be the bug.
Steps to reproduce:
PBM is configured to take PTIR backups every 10 minutes
A full backup is triggered
The full backup fails due to starting ac the exact point a PITR backup was being taken. This caused the full backup to fail.
There is no way to delete the failed job record, except by manual deletion from Mongo or forcing a resync. There may be circumstances where forcing a resync doesn't work, as job results have been written to disk.
You will note from the timestamps this occurred a while ago. The impact of the failure has only recently come to light, as cleanup of old backups failed.
Logs from the initial backup failure:
The above failure created the following record in pbmBackups:
Which pbm status 1.6.0 reported as:
pbm status 1.8.1 reports the same status as
From my perspective, the bug isn't that the backup failed – failures can happen for lots of reasons. The bug is that PBM profiles no way to delete the failed job without manual intervention:
It should be possible to force PBM to cleanup any traces of an incomplete backup, possibly with an extra flag?
Environment
None
Smart Checklist
Activity
Aaditya Dubey December 10, 2023 at 8:36 AM
Hi ,
Closing the report, no activity for a long!
Aaditya Dubey January 27, 2023 at 2:19 PM
Hi ,
Thank you for the report. Please let me know if issue is still persists.
This issue occurred in a production environment using 1.6.0. I have not managed to reproduce the initial race condition in 1.8.1, but I have confirmed the resulting inability to delete jobs in a bad state still applies which is what I consider to be the bug.
Steps to reproduce:
PBM is configured to take PTIR backups every 10 minutes
A full backup is triggered
The full backup fails due to starting ac the exact point a PITR backup was being taken. This caused the full backup to fail.
There is no way to delete the failed job record, except by manual deletion from Mongo or forcing a resync. There may be circumstances where forcing a resync doesn't work, as job results have been written to disk.
You will note from the timestamps this occurred a while ago. The impact of the failure has only recently come to light, as cleanup of old backups failed.
Logs from the initial backup failure:
The above failure created the following record in pbmBackups:
Which pbm status 1.6.0 reported as:
pbm status 1.8.1 reports the same status as
From my perspective, the bug isn't that the backup failed – failures can happen for lots of reasons. The bug is that PBM profiles no way to delete the failed job without manual intervention:
It should be possible to force PBM to cleanup any traces of an incomplete backup, possibly with an extra flag?