PBM failed to finish external backup

Description

Hello,

We use Sanoid to perform ZFS snapshot with the help of pre and post snapshot scripts. The pre snapshot script looks like this (/opt/sanoid-scripts/start-backup.sh):

#!/bin/bash set -e . /etc/pbm.env if [ "$(pbm status -o json | jq -r '.running')" != "{}" ] ; then echo "Backup already in progress" exit 1 fi pbm backup -t external

Then Sanoid creates a snapshot.

The post-snapshot script is (/opt/sanoid-scripts/stop-backup.sh):

#!/bin/bash set -e . /etc/pbm.env if [ "$(pbm status -o json | jq -r '.running')" == "{}" ] ; then echo "Backup not in progress" exit 1 fi BACKUP_NAME=$(pbm status -o json | jq -r ".running.name") if [ -z "${BACKUP_NAME}" ] ; then echo "Could not find running backup name" exit 1 fi pbm backup-finish ${BACKUP_NAME}

The snapshot has been created:

Jan 25 00:00:01 pm-1-backup-1.fqdn sanoid[1570056]: INFO: taking snapshots... Jan 25 00:00:01 pm-1-backup-1.fqdn sanoid[1570056]: executing pre_snapshot_script '/opt/sanoid-scripts/start-backup.sh' on dataset 'tank/cluster1' Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: Starting backup '2024-01-25T00:00:01Z'......Ready to copy data from: Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: - pm-1-backup-1.fqdn:27017 Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: After the copy is done, run: pbm backup-finish 2024-01-25T00:00:01Z Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570056]: taking snapshot tank/cluster1@autosnap_2024-01-25_00:00:01_daily Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570056]: executing post_snapshot_script '/opt/sanoid-scripts/stop-backup.sh' on dataset 'tank/cluster1' Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570318]: Command sent. Check `pbm describe-backup 2024-01-25T00:00:01Z` for the result.

But the backup itself is marked as still running:

Currently running: ================== Snapshot backup "2024-01-25T00:00:01Z", started at 2024-01-25T00:00:02Z. Status: copyDone. [op id: 65b1a481df6287206ac93e55]

It is now impossible to create a new backup:

pm-1-backup-1.fqdn ~ # pbm backup -t external Error: another operation in progress, Snapshot backup/65b1a481df6287206ac93e55 [cluster1/pm-1-backup-1.fqdn:27017]

Nor delete the current one:

pm-1-backup-1.fqdn ~ # pbm delete-backup -f 2024-01-25T00:00:01Z Waiting for delete to be done ..Error: deleting: unable to delete backup in copyDone state

Here is the backup description:

name: "2024-01-25T00:00:01Z" opid: 65b1a481df6287206ac93e55 type: external last_write_time: "2024-01-25T00:00:03Z" last_transition_time: "2024-01-25T00:00:08Z" mongodb_version: 7.0.2-1 fcv: "7.0" pbm_version: 2.3.1 status: copyDone size_h: 0 B replsets: - name: cluster1 status: copyReady node: pm-1-backup-1.fqdn:27017 last_write_time: "2024-01-25T00:00:03Z" last_transition_time: "2024-01-25T00:00:06Z" security: {}

PBM version:

Version: 2.3.1 Platform: linux/amd64 GitCommit: 8c4265cfb2d9a7581b782a829246d8fcb6c7d655 GitBranch: release-2.3.1 BuildTime: 2023-11-29_13:34_UTC GoVersion: go1.19

Installed via Percona Debian repositories.

The error can be reproduced by creating a script that runs both pre and post script sequentially:

#!/bin/bash set -e /opt/sanoid-scripts/start-backup.sh /opt/sanoid-scripts/stop-backup.sh

The fix is to follow the optional step (3) described in the documentation and perform a describe backup, then search for statuses and wait if they are not “copyReady” (at PBM and replica set levels) before ending the pre snapshot script (/opt/sanoid-scripts/start-backup.sh):

#!/bin/bash set -e . /etc/pbm.env if [ "$(pbm status -o json | jq -r '.running')" != "{}" ] ; then echo "Backup already in progress" exit 1 fi pbm backup -t external # Wait for backup to be ready BACKUP_NAME=$(pbm status -o json | jq -r ".running.name") if [ -z "${BACKUP_NAME}" ] ; then echo "Could not find running backup name" exit 1 fi for i in $(seq 1 12) ; do BACKUP_STATUS=$(pbm describe-backup ${BACKUP_NAME} -o json | jq -r ".status" | head -n 1) BACKUP_REPLICA_SET_STATUS=$(pbm describe-backup ${BACKUP_NAME} -o json | jq -r ".replsets[].status" | head -n 1) if [ "${BACKUP_STATUS}" == "copyReady" ] && [ "${BACKUP_REPLICA_SET_STATUS}" == "copyReady" ] ; then exit 0 fi "Backup not ready, waiting 5 seconds..." sleep 5 done echo "Backup not ready" exit 1

Is the describe operation really optional? What could explain this broken backup state when all the operations are performed very quickly?

Kind regards,

Environment

None

Activity

Show:

Julien Riou February 7, 2024 at 3:39 PM

Hi,

No worries. We have added a 60 seconds delay after the backup creation and there is no error anymore. But this is a quick-and-dirty workaround.

Aaditya Dubey February 7, 2024 at 2:57 PM

Hi

Thank you for the report.
We are checking the issue, will keep you posted.
Sorry for delay in response.

Julien Riou January 29, 2024 at 9:57 AM

The “workaround” has been applied for 3 days and we still have the issue:

pm-1-backup-1.fqdn ~ # pbm status Cluster: ======== cluster1: - cluster1/pm-1-arbiter-1.fqdn:27017 [!Arbiter]: arbiter node is not supported - cluster1/pm-1-node-1.fqdn:27017 [P]: pbm-agent v2.3.1 OK - cluster1/pm-1-node-2.fqdn:27017 [S]: pbm-agent v2.3.1 OK - cluster1/pm-1-backup-1.fqdn:27017 [S]: pbm-agent v2.3.1 OK PITR incremental backup: ======================== Status [ON] Currently running: ================== Snapshot backup "2024-01-27T00:00:01Z", started at 2024-01-27T00:00:01Z. Status: copyDone. [op id: 65b4478180cbf7c32750c0c7] Backups: ======== S3 gra s3://https://s3.gra.io.cloud.ovh.net/fr-internal-lab-cluster-1 Snapshots: 2024-01-27T00:00:01Z 0.00B <external> [running: copyDone / 2024-01-27T00:00:08Z] 2024-01-26T00:00:01Z 0.00B <external> [restore_to_time: 2024-01-26T00:00:03Z] 2024-01-25T14:18:26Z 0.00B <external> [restore_to_time: 2024-01-25T14:18:28Z] PITR chunks [12.06MB]: 2024-01-26T16:43:59Z - 2024-01-27T00:00:03Z (no base snapshot)
Done

Details

Assignee

Reporter

Labels

Needs Review

Yes

Needs QA

Yes

Sprint

Fix versions

Affects versions

Priority

Smart Checklist

Created January 25, 2024 at 2:31 PM
Updated October 9, 2024 at 10:16 AM
Resolved September 24, 2024 at 12:24 PM