Backup stalls with S3 upload network issue
Description
Environment
AFFECTED CS IDs
Attachments
- 14 Feb 2024, 06:37 PM
- 14 Feb 2024, 06:37 PM
- 12 Feb 2024, 12:09 PM
Activity
Pavel Tankov February 16, 2024 at 10:08 AM
So, I tried with minio, instead of Amazon S3, as Slava suggested. Now everything works as expected. Again, as Slava suggested, probably I wasn't properly testing it with Amazon S3, because I was just blocking the entire network for the backup pod, instead of blocking just the connection between the backup pod and the S3 bucket.
Now, with minio, I blocked the network on the minio pod so that the backup pod couldn't connect to it, but was otherwise reachable and the pod failed after some retried, eventually the backup completed:
kubectl -n pxco get pxc-backup
NAME CLUSTER STORAGE DESTINATION STATUS COMPLETED AGE
backup1 cluster1 minio s3://operator-testing/cluster1-2024-02-16-09:56:46-full Succeeded 10m 10m
backup2 cluster1 minio s3://operator-testing/cluster1-2024-02-16-10:00:16-full Succeeded 5m41s 7m2s
We can consider the bug as fixed.
Pavel Tankov February 14, 2024 at 6:37 PM
Please, see attached logs. Also:
kubectl -n pxco get pxc-backup
NAME CLUSTER STORAGE DESTINATION STATUS COMPLETED AGE
backup1 cluster1 s3-us-west s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-14-17:47:03-full Succeeded 48m 48m
backup2 cluster1 s3-us-west s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-14-18:16:51-full Running 18m
and
kubectl -n pxco get pxc-backup backup2 -oyaml
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterBackup
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"pxc.percona.com/v1","kind":"PerconaXtraDBClusterBackup","metadata":{"annotations":{},"name":"backup2","namespace":"pxco"},"spec":{"pxcCluster":"cluster1","storageName":"s3-us-west"}}
creationTimestamp: "2024-02-14T18:16:51Z"
generation: 1
name: backup2
namespace: pxco
resourceVersion: "31035"
uid: e1968012-74f6-455e-ad9a-08853db4a961
spec:
pxcCluster: cluster1
storageName: s3-us-west
status:
destination: s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-14-18:16:51-full
image: perconalab/percona-xtradb-cluster-operator:main-pxc8.0-backup
s3:
bucket: ptankov-copied-permissions-from-operator-testing
credentialsSecret: my-cluster-name-backup-s3
region: us-east-1
sslInternalSecretName: cluster1-ssl-internal
sslSecretName: cluster1-ssl
state: Running
storage_type: s3
storageName: s3-us-west
vaultSecretName: cluster1-vault
verifyTLS: true
Pavel Tankov February 12, 2024 at 12:09 PM
There is something which bothers me. Here is a scenario:
Deploy pxc operator and pxc server.
Have some amazon s3 bucket with proper permissions for a backup.
Deploy also Chaos Mesh in the same namespace as the pxc operator. We will need that in order to simulate a network outage for the backup pod. I used the following command:
helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=pxco --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock --set dashboard.create=false --version 2.5.1
This one is important: Make one manual (on demand) backup and let it complete successfully. Don’t interrupt it.
Optional: Insert some data in the MySQL server (whatever, just insert something, or not - however you prefer).
Do another manual backup, but quickly, while the backup pod is running execute the following command, where
backup-pod-name
is the name of the backup pod running: (you can use the attached yaml filechaos-network-loss.yml
)yq eval '.metadata.name = "chaos-pod-network-loss" | del(.spec.selector.pods.test-namespace) | .spec.selector.pods.pxco[0] = "backup-pod-name"' chaos-network-loss.yml | kubectl apply -f -
7. Now the pod backup-pod-name
has a network interrupted. You can verify by getting a shell inside the running pod and trying out: curl google.com
you should not be able to succeed.
8. Wait for 10 min - this is the duration time in the file chaos-network-loss.yml
600 seconds.
9. Get a shell again in the pod backup-pod-name
and verify that now its network is working just fine.
Problem: The pod backup-pod-name remains in status “Running“ forever. Also:
k -n pxco get pxc-backup
NAME CLUSTER STORAGE DESTINATION STATUS COMPLETED AGE
backup1 cluster1 s3-us-west s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-12-11:14:37-full Succeeded 53m 54m
backup2 cluster1 s3-us-west s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-12-11:28:47-full Running 39m
here ^^ the first backup is the one we let succeed and the 2-nd one is the one that never succeeds, nor does it fail.
Nickolay Ihalainen October 16, 2023 at 9:21 AM
@Slava Sarzhan --max-retries and --max-backoff work only with specific errors:
https://docs.percona.com/percona-xtrabackup/8.0/xbcloud-exbackoff.html#retriable-errors
https://perconadev.atlassian.net/browse/PXB-3073#icft=PXB-3073 introduces --timeout option with default 120 seconds (and potentially it could cause problems for the operator if previously S3 server was slower than 120 seconds). It should produce CURLE_OPERATION_TIMEDOUT error (already could be retried). Thus it should be enough to just upgrade to pxb 8.0.34 without adding --timeout=something to the operator code.
Przemyslaw Malkowski October 13, 2023 at 2:08 PM
@Slava Sarzhan Indeed, that should solve the issue. I can see this was just implemented in the latest xbcloud version:
https://jira.percona.com/browse/PXB-3073
So, to solve that issue for S3 upload, make sure to use that option in the next operator version, based on PXB 8.0.34.
During the backup upload to S3 from the garbd node, a network issue broke the S3 connection, but the backup script did not detect the stall, and the backup pod was stuck hanging.
The last messages from the backup pod were (real names obfuscated):
(...) 2023-07-12 02:02:41.215 INFO: [SST script] 230712 02:02:41 xbcloud: successfully uploaded chunk: ***-new-2023-07-12-02:00:31-full/dbname/table1.ibd.lz4.00000000000000000368, size: 10485817 2023-07-12 02:02:41.215 INFO: [SST script] 230712 02:02:41 xbcloud: successfully uploaded chunk: ***-new-2023-07-12-02:00:31-full/dbname/table2.ibd.lz4.00000000000000000345, size: 10485829 2023-07-12 06:54:38.555 WARN: Handshake failed: wrong version number 2023-07-12 06:54:38.559 WARN: Handshake failed: peer did not return a certificate
As seen above, there was almost five hours gap between the last chunk upload and the warning. But even the warning did not stop the backup pod. Only deleting the backup pod manually allowed to re-run the backup.
Suggested fix: monitor the progress of chunks upload, and if there is a long enough stall, re-try to S3 connection. If upload cannot be performed, interrupt the backup with an error.