Backup stalls with S3 upload network issue

Description

During the backup upload to S3 from the garbd node, a network issue broke the S3 connection, but the backup script did not detect the stall, and the backup pod was stuck hanging.

The last messages from the backup pod were (real names obfuscated):

 

(...) 2023-07-12 02:02:41.215  INFO: [SST script] 230712 02:02:41 xbcloud: successfully uploaded chunk: ***-new-2023-07-12-02:00:31-full/dbname/table1.ibd.lz4.00000000000000000368, size: 10485817 2023-07-12 02:02:41.215  INFO: [SST script] 230712 02:02:41 xbcloud: successfully uploaded chunk: ***-new-2023-07-12-02:00:31-full/dbname/table2.ibd.lz4.00000000000000000345, size: 10485829 2023-07-12 06:54:38.555  WARN: Handshake failed: wrong version number 2023-07-12 06:54:38.559  WARN: Handshake failed: peer did not return a certificate

As seen above, there was almost five hours gap between the last chunk upload and the warning. But even the warning did not stop the backup pod. Only deleting the backup pod manually allowed to re-run the backup.

 

Suggested fix: monitor the progress of chunks upload, and if there is a long enough stall, re-try to S3 connection. If upload cannot be performed, interrupt the backup with an error.

 

 

Environment

None

AFFECTED CS IDs

CS0037954

Attachments

3
  • 14 Feb 2024, 06:37 PM
  • 14 Feb 2024, 06:37 PM
  • 12 Feb 2024, 12:09 PM

Activity

Pavel Tankov February 16, 2024 at 10:08 AM

So, I tried with minio, instead of Amazon S3, as Slava suggested. Now everything works as expected. Again, as Slava suggested, probably I wasn't properly testing it with Amazon S3, because I was just blocking the entire network for the backup pod, instead of blocking just the connection between the backup pod and the S3 bucket.
Now, with minio, I blocked the network on the minio pod so that the backup pod couldn't connect to it, but was otherwise reachable and the pod failed after some retried, eventually the backup completed:

kubectl -n pxco get pxc-backup NAME CLUSTER STORAGE DESTINATION STATUS COMPLETED AGE backup1 cluster1 minio s3://operator-testing/cluster1-2024-02-16-09:56:46-full Succeeded 10m 10m backup2 cluster1 minio s3://operator-testing/cluster1-2024-02-16-10:00:16-full Succeeded 5m41s 7m2s

We can consider the bug as fixed.

Pavel Tankov February 14, 2024 at 6:37 PM

Please, see attached logs. Also:

kubectl -n pxco get pxc-backup NAME CLUSTER STORAGE DESTINATION STATUS COMPLETED AGE backup1 cluster1 s3-us-west s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-14-17:47:03-full Succeeded 48m 48m backup2 cluster1 s3-us-west s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-14-18:16:51-full Running 18m

and

kubectl -n pxco get pxc-backup backup2 -oyaml apiVersion: pxc.percona.com/v1 kind: PerconaXtraDBClusterBackup metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"pxc.percona.com/v1","kind":"PerconaXtraDBClusterBackup","metadata":{"annotations":{},"name":"backup2","namespace":"pxco"},"spec":{"pxcCluster":"cluster1","storageName":"s3-us-west"}} creationTimestamp: "2024-02-14T18:16:51Z" generation: 1 name: backup2 namespace: pxco resourceVersion: "31035" uid: e1968012-74f6-455e-ad9a-08853db4a961 spec: pxcCluster: cluster1 storageName: s3-us-west status: destination: s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-14-18:16:51-full image: perconalab/percona-xtradb-cluster-operator:main-pxc8.0-backup s3: bucket: ptankov-copied-permissions-from-operator-testing credentialsSecret: my-cluster-name-backup-s3 region: us-east-1 sslInternalSecretName: cluster1-ssl-internal sslSecretName: cluster1-ssl state: Running storage_type: s3 storageName: s3-us-west vaultSecretName: cluster1-vault verifyTLS: true

 

Pavel Tankov February 12, 2024 at 12:09 PM

There is something which bothers me. Here is a scenario:

  1. Deploy pxc operator and pxc server.

  2. Have some amazon s3 bucket with proper permissions for a backup.

  3. Deploy also Chaos Mesh in the same namespace as the pxc operator. We will need that in order to simulate a network outage for the backup pod. I used the following command:

    helm install chaos-mesh chaos-mesh/chaos-mesh --namespace=pxco --set chaosDaemon.runtime=containerd --set chaosDaemon.socketPath=/run/containerd/containerd.sock --set dashboard.create=false --version 2.5.1
  4. This one is important: Make one manual (on demand) backup and let it complete successfully. Don’t interrupt it.

  5. Optional: Insert some data in the MySQL server (whatever, just insert something, or not - however you prefer).

  6. Do another manual backup, but quickly, while the backup pod is running execute the following command, where backup-pod-name is the name of the backup pod running: (you can use the attached yaml file chaos-network-loss.yml)

    yq eval '.metadata.name = "chaos-pod-network-loss" | del(.spec.selector.pods.test-namespace) | .spec.selector.pods.pxco[0] = "backup-pod-name"' chaos-network-loss.yml | kubectl apply -f -

7. Now the pod backup-pod-name has a network interrupted. You can verify by getting a shell inside the running pod and trying out: curl google.com you should not be able to succeed.

8. Wait for 10 min - this is the duration time in the file chaos-network-loss.yml 600 seconds.

9. Get a shell again in the pod backup-pod-name and verify that now its network is working just fine.

Problem: The pod backup-pod-name remains in status “Running“ forever. Also:

k -n pxco get pxc-backup NAME CLUSTER STORAGE DESTINATION STATUS COMPLETED AGE backup1 cluster1 s3-us-west s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-12-11:14:37-full Succeeded 53m 54m backup2 cluster1 s3-us-west s3://ptankov-copied-permissions-from-operator-testing/cluster1-2024-02-12-11:28:47-full Running 39m

here ^^ the first backup is the one we let succeed and the 2-nd one is the one that never succeeds, nor does it fail.

Nickolay Ihalainen October 16, 2023 at 9:21 AM

--max-retries and --max-backoff work only with specific errors:
https://docs.percona.com/percona-xtrabackup/8.0/xbcloud-exbackoff.html#retriable-errors

https://perconadev.atlassian.net/browse/PXB-3073#icft=PXB-3073 introduces --timeout option with default 120 seconds (and potentially it could cause problems for the operator if previously S3 server was slower than 120 seconds). It should produce CURLE_OPERATION_TIMEDOUT error (already could be retried). Thus it should be enough to just upgrade to pxb 8.0.34 without adding --timeout=something to the operator code.

Przemyslaw Malkowski October 13, 2023 at 2:08 PM

Indeed, that should solve the issue. I can see this was just implemented in the latest xbcloud version:

https://jira.percona.com/browse/PXB-3073

So, to solve that issue for S3 upload, make sure to use that option in the next operator version, based on PXB 8.0.34.

Done

Details

Assignee

Reporter

Needs QA

Yes

Fix versions

Affects versions

Priority

Smart Checklist

Created July 12, 2023 at 10:29 AM
Updated March 5, 2024 at 5:24 PM
Resolved February 16, 2024 at 10:08 AM

Flag notifications