We use Sanoid to perform ZFS snapshot with the help of pre and post snapshot scripts. The pre snapshot script looks like this (/opt/sanoid-scripts/start-backup.sh):
#!/bin/bash
set -e
. /etc/pbm.env
if [ "$(pbm status -o json | jq -r '.running')" != "{}" ] ; then
echo "Backup already in progress"
exit 1
fi
pbm backup -t external
Then Sanoid creates a snapshot.
The post-snapshot script is (/opt/sanoid-scripts/stop-backup.sh):
#!/bin/bash
set -e
. /etc/pbm.env
if [ "$(pbm status -o json | jq -r '.running')" == "{}" ] ; then
echo "Backup not in progress"
exit 1
fi
BACKUP_NAME=$(pbm status -o json | jq -r ".running.name")
if [ -z "${BACKUP_NAME}" ] ; then
echo "Could not find running backup name"
exit 1
fi
pbm backup-finish ${BACKUP_NAME}
The snapshot has been created:
Jan 25 00:00:01 pm-1-backup-1.fqdn sanoid[1570056]: INFO: taking snapshots...
Jan 25 00:00:01 pm-1-backup-1.fqdn sanoid[1570056]: executing pre_snapshot_script '/opt/sanoid-scripts/start-backup.sh' on dataset 'tank/cluster1'
Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: Starting backup '2024-01-25T00:00:01Z'......Ready to copy data from:
Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: - pm-1-backup-1.fqdn:27017
Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: After the copy is done, run: pbm backup-finish 2024-01-25T00:00:01Z
Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570056]: taking snapshot tank/cluster1@autosnap_2024-01-25_00:00:01_daily
Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570056]: executing post_snapshot_script '/opt/sanoid-scripts/stop-backup.sh' on dataset 'tank/cluster1'
Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570318]: Command sent. Check `pbm describe-backup 2024-01-25T00:00:01Z` for the result.
But the backup itself is marked as still running:
Currently running:
==================
Snapshot backup "2024-01-25T00:00:01Z", started at 2024-01-25T00:00:02Z. Status: copyDone. [op id: 65b1a481df6287206ac93e55]
It is now impossible to create a new backup:
pm-1-backup-1.fqdn ~ # pbm backup -t external
Error: another operation in progress, Snapshot backup/65b1a481df6287206ac93e55 [cluster1/pm-1-backup-1.fqdn:27017]
Nor delete the current one:
pm-1-backup-1.fqdn ~ # pbm delete-backup -f 2024-01-25T00:00:01Z
Waiting for delete to be done ..Error: deleting: unable to delete backup in copyDone state
The error can be reproduced by creating a script that runs both pre and post script sequentially:
#!/bin/bash
set -e
/opt/sanoid-scripts/start-backup.sh
/opt/sanoid-scripts/stop-backup.sh
The fix is to follow the optional step (3) described in the documentation and perform a describe backup, then search for statuses and wait if they are not “copyReady” (at PBM and replica set levels) before ending the pre snapshot script (/opt/sanoid-scripts/start-backup.sh):
#!/bin/bash
set -e
. /etc/pbm.env
if [ "$(pbm status -o json | jq -r '.running')" != "{}" ] ; then
echo "Backup already in progress"
exit 1
fi
pbm backup -t external
# Wait for backup to be ready
BACKUP_NAME=$(pbm status -o json | jq -r ".running.name")
if [ -z "${BACKUP_NAME}" ] ; then
echo "Could not find running backup name"
exit 1
fi
for i in $(seq 1 12) ; do
BACKUP_STATUS=$(pbm describe-backup ${BACKUP_NAME} -o json | jq -r ".status" | head -n 1)
BACKUP_REPLICA_SET_STATUS=$(pbm describe-backup ${BACKUP_NAME} -o json | jq -r ".replsets[].status" | head -n 1)
if [ "${BACKUP_STATUS}" == "copyReady" ] && [ "${BACKUP_REPLICA_SET_STATUS}" == "copyReady" ] ; then
exit 0
fi
"Backup not ready, waiting 5 seconds..."
sleep 5
done
echo "Backup not ready"
exit 1
Is the describe operation really optional? What could explain this broken backup state when all the operations are performed very quickly?
Kind regards,
Environment
None
Activity
Show:
Julien Riou February 7, 2024 at 3:39 PM
Hi,
No worries. We have added a 60 seconds delay after the backup creation and there is no error anymore. But this is a quick-and-dirty workaround.
Aaditya Dubey February 7, 2024 at 2:57 PM
Hi @Julien Riou
Thank you for the report. We are checking the issue, will keep you posted. Sorry for delay in response.
Julien Riou January 29, 2024 at 9:57 AM
The “workaround” has been applied for 3 days and we still have the issue:
pm-1-backup-1.fqdn ~ # pbm status
Cluster:
========
cluster1:
- cluster1/pm-1-arbiter-1.fqdn:27017 [!Arbiter]: arbiter node is not supported
- cluster1/pm-1-node-1.fqdn:27017 [P]: pbm-agent v2.3.1 OK
- cluster1/pm-1-node-2.fqdn:27017 [S]: pbm-agent v2.3.1 OK
- cluster1/pm-1-backup-1.fqdn:27017 [S]: pbm-agent v2.3.1 OK
PITR incremental backup:
========================
Status [ON]
Currently running:
==================
Snapshot backup "2024-01-27T00:00:01Z", started at 2024-01-27T00:00:01Z. Status: copyDone. [op id: 65b4478180cbf7c32750c0c7]
Backups:
========
S3 gra s3://https://s3.gra.io.cloud.ovh.net/fr-internal-lab-cluster-1
Snapshots:
2024-01-27T00:00:01Z 0.00B <external> [running: copyDone / 2024-01-27T00:00:08Z]
2024-01-26T00:00:01Z 0.00B <external> [restore_to_time: 2024-01-26T00:00:03Z]
2024-01-25T14:18:26Z 0.00B <external> [restore_to_time: 2024-01-25T14:18:28Z]
PITR chunks [12.06MB]:
2024-01-26T16:43:59Z - 2024-01-27T00:00:03Z (no base snapshot)
Hello,
We use Sanoid to perform ZFS snapshot with the help of pre and post snapshot scripts. The pre snapshot script looks like this (
/opt/sanoid-scripts/start-backup.sh
):#!/bin/bash set -e . /etc/pbm.env if [ "$(pbm status -o json | jq -r '.running')" != "{}" ] ; then echo "Backup already in progress" exit 1 fi pbm backup -t external
Then Sanoid creates a snapshot.
The post-snapshot script is (
/opt/sanoid-scripts/stop-backup.sh
):#!/bin/bash set -e . /etc/pbm.env if [ "$(pbm status -o json | jq -r '.running')" == "{}" ] ; then echo "Backup not in progress" exit 1 fi BACKUP_NAME=$(pbm status -o json | jq -r ".running.name") if [ -z "${BACKUP_NAME}" ] ; then echo "Could not find running backup name" exit 1 fi pbm backup-finish ${BACKUP_NAME}
The snapshot has been created:
Jan 25 00:00:01 pm-1-backup-1.fqdn sanoid[1570056]: INFO: taking snapshots... Jan 25 00:00:01 pm-1-backup-1.fqdn sanoid[1570056]: executing pre_snapshot_script '/opt/sanoid-scripts/start-backup.sh' on dataset 'tank/cluster1' Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: Starting backup '2024-01-25T00:00:01Z'......Ready to copy data from: Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: - pm-1-backup-1.fqdn:27017 Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570083]: After the copy is done, run: pbm backup-finish 2024-01-25T00:00:01Z Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570056]: taking snapshot tank/cluster1@autosnap_2024-01-25_00:00:01_daily Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570056]: executing post_snapshot_script '/opt/sanoid-scripts/stop-backup.sh' on dataset 'tank/cluster1' Jan 25 00:00:08 pm-1-backup-1.fqdn sanoid[1570318]: Command sent. Check `pbm describe-backup 2024-01-25T00:00:01Z` for the result.
But the backup itself is marked as still running:
Currently running: ================== Snapshot backup "2024-01-25T00:00:01Z", started at 2024-01-25T00:00:02Z. Status: copyDone. [op id: 65b1a481df6287206ac93e55]
It is now impossible to create a new backup:
pm-1-backup-1.fqdn ~ # pbm backup -t external Error: another operation in progress, Snapshot backup/65b1a481df6287206ac93e55 [cluster1/pm-1-backup-1.fqdn:27017]
Nor delete the current one:
pm-1-backup-1.fqdn ~ # pbm delete-backup -f 2024-01-25T00:00:01Z Waiting for delete to be done ..Error: deleting: unable to delete backup in copyDone state
Here is the backup description:
name: "2024-01-25T00:00:01Z" opid: 65b1a481df6287206ac93e55 type: external last_write_time: "2024-01-25T00:00:03Z" last_transition_time: "2024-01-25T00:00:08Z" mongodb_version: 7.0.2-1 fcv: "7.0" pbm_version: 2.3.1 status: copyDone size_h: 0 B replsets: - name: cluster1 status: copyReady node: pm-1-backup-1.fqdn:27017 last_write_time: "2024-01-25T00:00:03Z" last_transition_time: "2024-01-25T00:00:06Z" security: {}
PBM version:
Version: 2.3.1 Platform: linux/amd64 GitCommit: 8c4265cfb2d9a7581b782a829246d8fcb6c7d655 GitBranch: release-2.3.1 BuildTime: 2023-11-29_13:34_UTC GoVersion: go1.19
Installed via Percona Debian repositories.
The error can be reproduced by creating a script that runs both pre and post script sequentially:
#!/bin/bash set -e /opt/sanoid-scripts/start-backup.sh /opt/sanoid-scripts/stop-backup.sh
The fix is to follow the optional step (3) described in the documentation and perform a describe backup, then search for statuses and wait if they are not “copyReady” (at PBM and replica set levels) before ending the pre snapshot script (
/opt/sanoid-scripts/start-backup.sh
):#!/bin/bash set -e . /etc/pbm.env if [ "$(pbm status -o json | jq -r '.running')" != "{}" ] ; then echo "Backup already in progress" exit 1 fi pbm backup -t external # Wait for backup to be ready BACKUP_NAME=$(pbm status -o json | jq -r ".running.name") if [ -z "${BACKUP_NAME}" ] ; then echo "Could not find running backup name" exit 1 fi for i in $(seq 1 12) ; do BACKUP_STATUS=$(pbm describe-backup ${BACKUP_NAME} -o json | jq -r ".status" | head -n 1) BACKUP_REPLICA_SET_STATUS=$(pbm describe-backup ${BACKUP_NAME} -o json | jq -r ".replsets[].status" | head -n 1) if [ "${BACKUP_STATUS}" == "copyReady" ] && [ "${BACKUP_REPLICA_SET_STATUS}" == "copyReady" ] ; then exit 0 fi "Backup not ready, waiting 5 seconds..." sleep 5 done echo "Backup not ready" exit 1
Is the describe operation really optional? What could explain this broken backup state when all the operations are performed very quickly?
Kind regards,