PXB stopping SQL thread while backing up replica

General

Escalation

General

Escalation

Description

I'm executing an xtrabackup and it is stopping the SQL thread on my replica right at the start of the backup. The SQL thread remains stopped until I restart it.

My guess is that it is the --safe-slave-backup option causing the issue, but I haven't tested to confirm.

Running the following xtrabackup command: xtrabackup --defaults-extra-file=/data/etc/mysql/bkup.cnf --backup --slave-info --safe-slave-backup --stream=xbstream --parallel=4 --databases-exclude=_pending_drops | pigz --fast -p 1 -c > /data/backups/20230413122358/innobackup_vam-mysql-jaytest-002-02.iad3b.square-20230413122358.xbstream.gz
2023-04-13T12:24:00.358669-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized server arguments: --datadir=/data/mysql --tmpdir=/data/tmp --server-id=170033278 --log_bin=mysql-bin --innodb_buffer_pool_size=50G --innodb_flush_method=O_DIRECT
2023-04-13T12:24:00.359652-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized client arguments: --port=3306 --socket=/var/lib/mysql/mysql.sock --user=bkup --password=* --user=bkup --password=* --backup=1 --slave-info=1 --safe-slave-backup=1 --stream=xbstream --parallel=4 --databases-exclude=_pending_drops
xtrabackup version 8.0.31-24 based on MySQL server 8.0.31 Linux (x86_64) (revision id: f0754edb)
230413 12:24:00 version_check Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup;port=3306;mysql_socket=/var/lib/mysql/mysql.sock' as 'bkup' (using password: YES).
230413 12:24:00 version_check Connected to MySQL server
230413 12:24:00 version_check Executing a version check against the server...
230413 12:24:00 version_check Done.
2023-04-13T12:24:00.494339-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: bkup, password: set, port: 3306, socket: /var/lib/mysql/mysql.sock
2023-04-13T12:24:00.520095-00:00 0 [Note] [MY-011825] [Xtrabackup] Using server version 8.0.31-23
2023-04-13T12:24:00.552525-00:00 0 [Note] [MY-011825] [Xtrabackup] Slave open temp tables: 0
2023-04-13T12:24:00.554678-00:00 0 [Note] [MY-011825] [Xtrabackup] Slave is safe to backup.
2023-04-13T12:24:00.554789-00:00 0 [Note] [MY-011825] [Xtrabackup] Executing LOCK TABLES FOR BACKUP ...
2023-04-13T12:24:00.564691-00:00 0 [Note] [MY-011825] [Xtrabackup] uses posix_fadvise().
2023-04-13T12:24:00.564749-00:00 0 [Note] [MY-011825] [Xtrabackup] cd to /data/mysql

Environment

xtrabackup version 8.0.31-24 based on MySQL server 8.0.31 Linux (x86_64) (revision id: f0754edb)
Server version: 8.0.31-23 Percona Server (GPL), Release 23.sq, Revision 771bf6f2ea43b51760313b09218ab30bb6b4156a

Linked work items

relates to

PXB-2867

Return the behavior of safe-slave-backup before the lock-ddl implementation

PXB-3033

Execute STOP SLAVE conditionally depending on --lock-ddl

Activity

Aaditya Dubey

January 17, 2024 at 3:17 PM

Hi @Jay Janssen

Please find the answers regarding why SQL thread is stopped during entire backup process:

The `--safe-slave-backup` option¶

In order to assure a consistent replication state, this option stops the replication SQL thread and waits to start backing up until Slave_open_temp_tables in SHOW STATUS is zero. If there are no open temporary tables, the backup will take place, otherwise the SQL thread will be started and stopped until there are no open temporary tables. The backup will fail if Slave_open_temp_tables does not become zero after --safe-slave-backup-timeout seconds (defaults to 300 seconds). The replication SQL thread will be restarted when the backup finishes.

Reference: https://docs.percona.com/percona-xtrabackup/8.0/make-backup-in-replication-env.html

Bug is there indeed if XtraBackup is somehow failed due to any reason then “Replica_SQL_Running: No“ won't be started again automatically. Please note this is the case of safe backup option:

When using PS 8.0.31 & PXB 8.0.31

Sysbench started:

$ sysbench /usr/share/sysbench/oltp_read_write.lua --mysql_storage_engine=innodb --table-size=1000 --tables=10 --mysql-db=test --mysql-user=msandbox --mysql-password=msandbox --mysql-socket=/tmp/mysql_sandbox22133.sock --threads=15 --time=0 --report-interval=1 --events=0 --db-driver=mysql run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 15
Report intermediate results every 1 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

...

BackUp Started :

$ ./xtrabackup --user=msandbox --password=msandbox --socket=/tmp/mysql_sandbox22133.sock --backup --slave-info --safe-slave-backup --stream=xbstream --parallel=4 --datadir=$HOME/sandboxes/rsandbox_ps8_0_31/node1/data/ --target-dir=$HOME/backup_full/node1/data/

...

Killed in-Between

slave1 [localhost:22133] {msandbox} ((none)) > show replica status\G
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 127.0.0.1
                  Source_User: rsandbox
                  Source_Port: 22132
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000001
          Read_Source_Log_Pos: 8330
               Relay_Log_File: mysql-relay.000002
                Relay_Log_Pos: 8546
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 8330
              Relay_Log_Space: 8752
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File: 
           Source_SSL_CA_Path: 
              Source_SSL_Cert: 
            Source_SSL_Cipher: 
               Source_SSL_Key: 
        Seconds_Behind_Source: NULL
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Source_Server_Id: 100
                  Source_UUID: 00022132-1111-1111-1111-111111111111
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: 
           Source_Retry_Count: 86400
                  Source_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Source_SSL_Crl: 
           Source_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Source_TLS_Version: 
       Source_public_key_path: 
        Get_Source_public_key: 1
            Network_Namespace: 
1 row in set (0.00 sec)

It is stopped & not starting anymore! Need manual startUp

Same behaviour exists with PXB 8.0.35 as well 

sending the concern to engineering for further review and updates.

Aaditya Dubey

May 18, 2023 at 8:59 AM

Hi @Jay Janssen,

Thank you for the updates, Will take further look and see if this can be repeated.

Jay Janssen

May 17, 2023 at 2:34 PM

Hi, I believe the scenario requires xtrabackup to exit 1 somehow. If you look at my log above, xtrabackup got this error when I observed the behavior.

2023-04-12T15:56:10.668127-00:00 1 [ERROR] [MY-012894] [InnoDB] Unable to open './#innodb_redo/#ib_redo8729' (error: 1504).

Can you trigger an error exit by sending XB a signal perhaps?

Aaditya Dubey

May 17, 2023 at 1:18 PM

Hi @Jay Janssen,

Thank you for the report.
Unfortunately this issue is not repeating from my end, Please find the test case below:

$ xtrabackup --version
2023-05-17T13:13:09.229434-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized server arguments: --datadir=/var/lib/mysql 
xtrabackup version 8.0.31-24 based on MySQL server 8.0.31 Linux (x86_64) (revision id: f0754edb)

slave1 [localhost:22133] {msandbox} ((none)) > select version();
+-----------+
| version() |
+-----------+
| 8.0.31-23 |
+-----------+
1 row in set (0.00 sec)

Executed R/W load on master server using below sysbench command:

sysbench /usr/share/sysbench/oltp_read_write.lua --mysql_storage_engine=innodb --table-size=1000 --tables=10 --mysql-db=test --mysql-user=xxxx --mysql-password=xxxx --mysql-socket=/tmp/mysql_sandbox22132.sock --threads=15 --time=0 --report-interval=1 --events=0 --db-driver=mysql run

Backup Command:

$ xtrabackup --user=xxxx --password=xxxx --socket=/tmp/mysql_sandbox22133.sock --backup --slave-info --safe-slave-backup --stream=xbstream --parallel=4 --datadir=$HOME/sandboxes/rsandbox_ps8_0_31/node1/data/ --target-dir=$HOME/backup_full/node1/data/

...
XBSTCK01Extrabackup_infoXBSTCK01Pxtrabackup_tablespaces''?4{"version":3,"external_tablespaces":[]}XBSTCK01Extrabackup_tablespaces2023-05-17T13:07:14.414684-00:00 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (30844188) to (46045387) was copied.
2023-05-17T13:07:14.704853-00:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!

Replica_SQL_Running: No -> Seen during backup but restarted after backup is completed.

slave1 [localhost:22133] {msandbox} ((none)) > show replica status\G
*************************** 1. row ***************************
             Replica_IO_State: Waiting for source to send event
                  Source_Host: 127.0.0.1
                  Source_User: rsandbox
                  Source_Port: 22132
                Connect_Retry: 60
              Source_Log_File: mysql-bin.000001
          Read_Source_Log_Pos: 48655530
               Relay_Log_File: mysql-relay.000002
                Relay_Log_Pos: 23568323
        Relay_Source_Log_File: mysql-bin.000001
           Replica_IO_Running: Yes
          Replica_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Source_Log_Pos: 23568107
              Relay_Log_Space: 48655952
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Source_SSL_Allowed: No
           Source_SSL_CA_File: 
           Source_SSL_CA_Path: 
              Source_SSL_Cert: 
            Source_SSL_Cipher: 
               Source_SSL_Key: 
        Seconds_Behind_Source: NULL
Source_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Source_Server_Id: 100
                  Source_UUID: 00022132-1111-1111-1111-111111111111
             Source_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
    Replica_SQL_Running_State: 
           Source_Retry_Count: 86400
                  Source_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 
               Source_SSL_Crl: 
           Source_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Source_TLS_Version: 
       Source_public_key_path: 
        Get_Source_public_key: 1
            Network_Namespace: 
1 row in set (0.00 sec)

Please let us know if full reproducible test case is available.

Jay Janssen

April 17, 2023 at 2:44 PM

Got it, and that's a good point.

My main issue is more that the SQL thread is getting stopped for the entire backup now (though it not getting restarted after errors is an issue too). I presume this is because explicit temp tables are now Innodb-based.

The only way around replication being down this long I can see is to just not use `--safe-slave-backup` and trust that my users are not switching binlog formats in session as well as using explicit temp tables to load into permanent tables.

Resize issue view side panel

Details

Assignee

Unassigned

Reporter

Jay Janssen

Needs QA

Yes

Affects versions

8.0.31-24 (Q4 2022)

8.0.35-30

Priority

Medium

Created April 13, 2023 at 12:39 PM

Updated July 22, 2024 at 1:18 PM

PXB stopping SQL thread while backing up replica

Description

Environment

Linked work items

relates to

Activity

The --safe-slave-backup option¶

Details

Assignee

Reporter

Needs QA

Affects versions

Priority

The `--safe-slave-backup` option¶