Backup hangs after pbm-agent is restarted

Description

Hi,

Due to https://jira.percona.com/browse/PBM-1126 bug, we have a problem with backups. If the resource memory limit is small (in our case, it is 500M), the pbm-agent could be killed during the backup process by OOMKilling if it uses more than 500M.  

From agent log:
2023/11/14 20:08:58 [entrypoint] `pbm-agent` exited with code -1
2023/11/14 20:08:58 [entrypoint] restart in 5 sec
2023/11/14 20:09:03 [entrypoint] starting `pbm-agent`
2023-11-14T20:09:03.000+0000 I pbm-agent:
Version: 2.3.0
Platform: linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2023-11-14T20:09:03.000+0000 I starting PITR routine
2023-11-14T20:09:03.000+0000 I node: rs0/mongodb-1a1-rs0-2.mongodb-1a1-rs0.percona-everest.svc.cluster.local:27017
2023-11-14T20:09:03.000+0000 I listening for the commands

As a result, the pbm agent can't continue/restart/fail backup after the restart, and it hangs. 

 

It is a critical issue for us because backups stop working at all after the pbm-agent restart. Please fix this issue.

 

Environment

None

Attachments

1

Activity

Show:

Boris Ilijic August 14, 2024 at 11:48 AM
Edited

Hi,

This issue is not possible to reproduce using PBM CLI.
The only way to reproduce is to send backup command directly to pbmCmd collection, which is approach that Operator is currently using:
https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/controller/perconaservermongodbbackup/backup.go#L80

To simulate the same usage pattern as ‘operator’, I was using this command:

 

Explanation of the bug:
When Operator triggers backup using SendCmd, PBM's lock validation/guard is skipped because PBM does that on CLI side, and because Operator uses pbmCmd that part is skipped:

[pbm cli] -> [pbmCmd communication 'channel'] -> [pbmAgent]

For the case when PBM Agent is restarted (e.g. due to OOMKilled), backup is typically still running on other nodes, but Operator sends new backup cmd and because we are skipping lock check, PMB Agent starts to run things in parallel (old backup on non-restarted nodes, new backup or restarted node), and at the end we have results described in the ticket.

This fix will prevent that error situation in the way that it'll not allow starting new backup, until old backup is completely done (on all shards).
In case when Operator starts new backup, and the previous one is in progress, following error message will be logged:

In future, we can add RunBackup cmd in SDK, or something similar, and make things easier for everybody.

Jan Mynar June 27, 2024 at 9:39 AM

according to the team findings, the agent was restarted.
And we have solution proposal how to fix it - Get rid of the logs, mark backup as a field, and make it possible to run backup again.

Slava Sarzhan June 20, 2024 at 1:45 PM

Hi Please check attachment.

If you need any additional information from me just ping.

Jan Mynar May 30, 2024 at 9:33 AM

Please, are you able to repro this issue and provide the PBM agent containers logs, from all agents? thanks

Slava Sarzhan May 7, 2024 at 4:54 PM

Hi ,

In my case, I can’t start a new backup


from pbm:

As you can see all new backups just “stuck”.

Done

Details

Assignee

Reporter

Needs QA

Yes

Sprint

Fix versions

Affects versions

Priority

Smart Checklist

Created November 16, 2023 at 9:18 AM
Updated September 2, 2024 at 8:57 AM
Resolved August 28, 2024 at 12:05 PM