Backup hangs after pbm-agent is restarted
Description
Environment
Attachments
relates to
Activity
Boris Ilijic August 14, 2024 at 11:48 AMEdited
Hi,
This issue is not possible to reproduce using PBM CLI.
The only way to reproduce is to send backup command directly to pbmCmd collection, which is approach that Operator is currently using:
https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/controller/perconaservermongodbbackup/backup.go#L80
To simulate the same usage pattern as ‘operator’, I was using this command:
Explanation of the bug:
When Operator triggers backup using SendCmd, PBM's lock validation/guard is skipped because PBM does that on CLI side, and because Operator uses pbmCmd that part is skipped:
[pbm cli] -> [pbmCmd communication 'channel'] -> [pbmAgent]
For the case when PBM Agent is restarted (e.g. due to OOMKilled), backup is typically still running on other nodes, but Operator sends new backup cmd and because we are skipping lock check, PMB Agent starts to run things in parallel (old backup on non-restarted nodes, new backup or restarted node), and at the end we have results described in the ticket.
This fix will prevent that error situation in the way that it'll not allow starting new backup, until old backup is completely done (on all shards).
In case when Operator starts new backup, and the previous one is in progress, following error message will be logged:
In future, we can add RunBackup cmd in SDK, or something similar, and make things easier for everybody.
Jan Mynar June 27, 2024 at 9:39 AM
according to the team findings, the agent was restarted.
And we have solution proposal how to fix it - Get rid of the logs, mark backup as a field, and make it possible to run backup again.
Slava Sarzhan June 20, 2024 at 1:45 PM
Hi Please check attachment.
If you need any additional information from me just ping.
Jan Mynar May 30, 2024 at 9:33 AM
Please, are you able to repro this issue and provide the PBM agent containers logs, from all agents? thanks
Slava Sarzhan May 7, 2024 at 4:54 PM
Hi ,
In my case, I can’t start a new backup
from pbm:
As you can see all new backups just “stuck”.
Hi,
Due to https://jira.percona.com/browse/PBM-1126 bug, we have a problem with backups. If the resource memory limit is small (in our case, it is 500M), the pbm-agent could be killed during the backup process by OOMKilling if it uses more than 500M.
From agent log:
2023/11/14 20:08:58 [entrypoint] `pbm-agent` exited with code -1
2023/11/14 20:08:58 [entrypoint] restart in 5 sec
2023/11/14 20:09:03 [entrypoint] starting `pbm-agent`
2023-11-14T20:09:03.000+0000 I pbm-agent:
Version: 2.3.0
Platform: linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2023-11-14T20:09:03.000+0000 I starting PITR routine
2023-11-14T20:09:03.000+0000 I node: rs0/mongodb-1a1-rs0-2.mongodb-1a1-rs0.percona-everest.svc.cluster.local:27017
2023-11-14T20:09:03.000+0000 I listening for the commands
As a result, the pbm agent can't continue/restart/fail backup after the restart, and it hangs.
It is a critical issue for us because backups stop working at all after the pbm-agent restart. Please fix this issue.