The operator should use a supported/documented backup method

Description

Currently, we perform backup using the following steps:

1. start up garbd
2. after 20 seconds of waiting, kill garbd (we assume that in 20 seconds, it was able to connect to the donor and send an SST request. We also assume that the donor won't cancel the SST request after garbd disappears)
3. start listening on port 4444 with socat. Hope that the sst script is nice, and it is still trying to connect/send the data. (see above we also assume comment...)
4. based on xtrabackup sst script internals, receive first the sst info, and then the actual backup data
5. exit, backup is done

From this list, 4 is undocumented, and we can easily break it by accident, but at least we know that it works because we developed both scripts. We can depend on that if we do document this somewhere.

However, the fact that the first 3 step works is pure luck, and depends on the system configuration and state (e.g. timeout values, load, ...).
For example if the donor is too quick (and it no longer tries to connect after 20 seconds), backup will fail.

Also, this is unsupported for multiple reasons:

  • According to codership, garbd can be used for taking backups, but the documentation doesn't mention killing garbd before the process completes (or even really starts). (in this thread, Alexey Yuchenko states that you have to setup the receiver node before starting garbd: https://groups.google.com/forum/#!topic/codership-team/icustzu0jqA, "Backup host must be prepared manually before running garbd in this case.")

  • The codership example also uses a custom SST script for backing up the data, not the normal sst script for node setup. The reason for this is that this way the backup process knowns what's exactly in the stream.

Possible solutions:

  • we can either use xtrabackup directly for backup (based on slack, we also want to support taking backups of other mysql setups, and this would also work with mariadb. This way, we don't have to develop multiple backup scripts)

  • or use a custom sst script as documented in the galera documentation. Based on the docs, we can even skip the sending the data over the network part, and save everything to S3 directly on the donor node. This process is documented here: https://galeracluster.com/library/documentation/scriptable-sst.html

Note: based on the galera documentation, and the galera mailing lists, we have to execute and manage multiple processes for taking backups, no matter the method we use.

  • When using xtrabackup, we have to manually execute xtrabackup on the pxc node

  • When saving to S3 on a backup node, the backup node has to run garbd and socat at the same time. What the backup script currently does is unsupported by galera.

  • When saving directly to S3 on the pxc node, managed by garbd on the backup node, that still ends up as 3 processes on 2 pods.

Environment

None

Smart Checklist

Activity

Show:

Mykola Marzhan December 2, 2020 at 10:31 AM

, I believe we can use the current implementation till the moment when we will need some advanced features in the backup, like "incremental backup support".
As soon as Cloud Team needs to modify wsrep_sst_xtrabackup-v2.sh script, we will start using our own "custom sst script", as suggested by Zsolt in the ticket description.
If Cloud team doesn't need changes wsrep_sst_xtrabackup-v2.sh - we can use default wsrep_sst_xtrabackup-v2.sh inside PXC.

Sergey Pronin December 1, 2020 at 11:40 AM

As long was released - is there anything to do here? 

 

Mykola Marzhan March 23, 2020 at 10:03 AM


I believe it is a brilliant idea to run joiner script from garbd and wait till the successful end of backup.
details - https://jira.percona.com/browse/PXC-2604?focusedCommentId=253009&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-253009

Zsolt Parragi March 23, 2020 at 8:57 AM

As discussed on slack: I was completely mislead by the "SST transfer completed successfhuly" message in the garbd logs. Looks like garbd always prints this out, before SST transfer actually starts at all. But I simply assumed that the log file didn't lie, and that garbd really waits until the SST transfer completes. Turns out that's not the case, sorry about that.

On the bright side: looks like really implementing this feature with a separate command line switch is possible: garbd would only exit after the SST process actually completed (either successfully or with an error)

But this would also mean that garbd would have to remain running until the SST process completes. This would also solve the issue you linked ()

Would this be useful for the operator?

Mykola Marzhan March 23, 2020 at 7:25 AM

Re: custom script: we already have a well-defined protocol that already does all needed things. it is just needed to add me to the reviewers list when we change sst script.

Re: first solution: we are not going to support group replication in Kubernetes Operator for Percona XtraDB Cluster.

Won't Do

Assignee

Reporter

Priority

Created March 23, 2020 at 6:16 AM
Updated March 5, 2024 at 6:18 PM
Resolved March 11, 2021 at 2:04 PM