Backup hangs after pbm-agent is restarted

General

Escalation

General

Escalation

Description

Hi,

Due to https://jira.percona.com/browse/PBM-1126 bug, we have a problem with backups. If the resource memory limit is small (in our case, it is 500M), the pbm-agent could be killed during the backup process by OOMKilling if it uses more than 500M.

From agent log:
2023/11/14 20:08:58 [entrypoint] `pbm-agent` exited with code -1
2023/11/14 20:08:58 [entrypoint] restart in 5 sec
2023/11/14 20:09:03 [entrypoint] starting `pbm-agent`
2023-11-14T20:09:03.000+0000 I pbm-agent:
Version: 2.3.0
Platform: linux/amd64
GitCommit: 3b1c2e263901cf041c6b83547f6f28ac2879911f
GitBranch: release-2.3.0
BuildTime: 2023-09-20_14:42_UTC
GoVersion: go1.19
2023-11-14T20:09:03.000+0000 I starting PITR routine
2023-11-14T20:09:03.000+0000 I node: rs0/mongodb-1a1-rs0-2.mongodb-1a1-rs0.percona-everest.svc.cluster.local:27017
2023-11-14T20:09:03.000+0000 I listening for the commands

As a result, the pbm agent can't continue/restart/fail backup after the restart, and it hangs.

It is a critical issue for us because backups stop working at all after the pbm-agent restart. Please fix this issue.

Environment

None

Attachments

Linked issues

relates to

PBM-673

The leader should mark backup as failed if none of the nodes started backup

Activity

Show:

Boris Ilijic August 14, 2024 at 11:48 AM
Edited

Hi,

This issue is not possible to reproduce using PBM CLI.
The only way to reproduce is to send backup command directly to pbmCmd collection, which is approach that Operator is currently using:
https://github.com/percona/percona-server-mongodb-operator/blob/main/pkg/controller/perconaservermongodbbackup/backup.go#L80

To simulate the same usage pattern as ‘operator’, I was using this command:

Explanation of the bug:
When Operator triggers backup using SendCmd, PBM's lock validation/guard is skipped because PBM does that on CLI side, and because Operator uses pbmCmd that part is skipped:

[pbm cli] -> [pbmCmd communication 'channel'] -> [pbmAgent]

For the case when PBM Agent is restarted (e.g. due to OOMKilled), backup is typically still running on other nodes, but Operator sends new backup cmd and because we are skipping lock check, PMB Agent starts to run things in parallel (old backup on non-restarted nodes, new backup or restarted node), and at the end we have results described in the ticket.

This fix will prevent that error situation in the way that it'll not allow starting new backup, until old backup is completely done (on all shards).
In case when Operator starts new backup, and the previous one is in progress, following error message will be logged:

In future, we can add RunBackup cmd in SDK, or something similar, and make things easier for everybody.

Jan Mynar June 27, 2024 at 9:39 AM

according to the team findings, the agent was restarted.
And we have solution proposal how to fix it - Get rid of the logs, mark backup as a field, and make it possible to run backup again.

Slava Sarzhan June 20, 2024 at 1:45 PM

Hi Please check attachment.

If you need any additional information from me just ping.

Jan Mynar May 30, 2024 at 9:33 AM

Please, are you able to repro this issue and provide the PBM agent containers logs, from all agents? thanks

Slava Sarzhan May 7, 2024 at 4:54 PM

Hi ,

In my case, I can’t start a new backup

from pbm:

As you can see all new backups just “stuck”.

Done

Details
Assignee
Boris Ilijic
Reporter
Slava Sarzhan
Labels
dbaasreviewed
Needs QA
Yes
Sprint
None
Fix versions
2.6.0
Affects versions
2.3.0
Priority
Critical

Smart Checklist

Created November 16, 2023 at 9:18 AM

Updated September 2, 2024 at 8:57 AM

Resolved August 28, 2024 at 12:05 PM

Configure

Backup hangs after pbm-agent is restarted

Description

Environment

Attachments

Linked issues

relates to

Activity

Boris Ilijic August 14, 2024 at 11:48 AMEdited

Jan Mynar June 27, 2024 at 9:39 AM

Slava Sarzhan June 20, 2024 at 1:45 PM

Jan Mynar May 30, 2024 at 9:33 AM

Slava Sarzhan May 7, 2024 at 4:54 PM

DetailsAssigneeBoris IlijicBoris IlijicReporterSlava SarzhanSlava SarzhanLabelsdbaasreviewedNeeds QAYesSprintNone+1Fix versions2.6.0Affects versions2.3.0PriorityCritical

Details

Assignee

Reporter

Labels

Needs QA

Sprint

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Boris Ilijic August 14, 2024 at 11:48 AM
Edited

Details
Assignee
Boris Ilijic
Reporter
Slava Sarzhan
Labels
dbaasreviewed
Needs QA
Yes
Sprint
None
Fix versions
2.6.0
Affects versions
2.3.0
Priority
Critical

Smart Checklist