Redo-log optimized DDL operation causing SST to fail

Description

Redo Optimized DDL operation (like CREATE INDEX with sorted index build) that doesn't flush REDO log immediately can cause SST to fail.

 

To reproduce:

  • Start a single node cluster

  • Create some initial workload to run DDL 

    sysbench --threads=1 --rate=0 --report-interval=1 --percentile=99 --events=0 --time=0 --mysql-ignore-errors=all \ --mysql-user=root --mysql-socket=/tmp/n1.sock /home/krunal.bauskar/tools/sysbench/install/share/sysbench/oltp_insert.lua \ --mysql-db=test1 --tables=1 --table_size=2000000 prepare sysbench --threads=1 --rate=0 --report-interval=1 --percentile=99 --events=0 --time=0 --mysql-ignore-errors=all \ --mysql-user=root --mysql-socket=/tmp/n1.sock /home/krunal.bauskar/tools/sysbench/install/share/sysbench/oltp_insert.lua \ --mysql-db=test2 --tables=1 --table_size=2000000 prepare
  • Run the below script to trigger the workload on node-1

    #!/bin/bash echo "drop table if exists test1.sb1"| ./bin/mysql --user root -S /tmp/n1.sock -D test echo "create table test1.sb1 as select id,c from test1.sbtest1 where id < 150000;"| ./bin/mysql --user root -S /tmp/n1.sock -D test echo "create unique index ix on test1.sb1 (id)"| ./bin/mysql --user root -S /tmp/n1.sock -D test echo "show tables" | ./bin/mysql --user root -S /tmp/n1.sock -D test1 sleep 1 echo "drop table if exists test2.sb1"| ./bin/mysql --user root -S /tmp/n1.sock -D test echo "create table test2.sb1 as select id,c from test2.sbtest1 where id < 150000;"| ./bin/mysql --user root -S /tmp/n1.sock -D test echo "create unique index ix on test2.sb1 (id)"| ./bin/mysql --user root -S /tmp/n1.sock -D test echo "show tables" | ./bin/mysql --user root -S /tmp/n1.sock -D test2
$ while true; do bash test.sh; done
  • Start node-2 (with clean-data-dir so that it join using SST)

  • Backup action on donor (during SST) will fail with the following error

    InnoDB: An optimized (without redo logging) DDLoperation has been performed. All modified pages may not have been flushed to the disk yet. PXB will not be able take a consistent backup. Retry the backup operation
  • SST on JOINER is aborted

2018-02-07T07:50:43.311902Z WSREP_SST: [INFO] Proceeding with SST......... 2018-02-07T07:50:43.351317Z WSREP_SST: [INFO] ............Waiting for SST streaming to complete! 2018-02-07T07:50:45.094016Z 0 [Note] WSREP: (9850103a, 'tcp://10.30.7.164:5030') turning message relay requesting off 2018-02-07T07:51:00.019524Z WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 2018-02-07T07:51:00.020471Z WSREP_SST: [ERROR] xtrabackup_checkpoints missing. xtrabackup/SST failed on DONOR. Check DONOR log 2018-02-07T07:51:00.021387Z WSREP_SST: [ERROR] ****************************************************** 2018-02-07T07:51:00.022508Z WSREP_SST: [ERROR] Cleanup after exit with status:2 2018-02-07T07:51:00.028387Z 0 [Warning] WSREP: 1.0 (n1): State transfer to 0.0 (n2) failed: -22 (Invalid argument) 2018-02-07T07:51:00.028428Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():765: Will never receive state. Need to abort. 2018-02-07T07:51:00.028463Z 0 [Note] WSREP: gcomm: terminating thread

 

Environment

None

Smart Checklist

Activity

ethaniel 1 November 26, 2020 at 9:49 AM

I could fix this problem for me with:

sed -e 's/--lock-ddl /--lock-ddl-per-table /g' /bin/wsrep_sst_xtrabackup-v2

Although this is not the recommended way (the devs suggest that I clone the wsrep_sst_xtrabackup-v2 into another one), it didn't cause me any harm in updating the main file.

ethaniel 1 November 25, 2020 at 8:21 PM

Looks like this bug (DDL during SST causes a block) is still active in 5.7.31:

https://jira.percona.com/browse/PXC-3489

(sorry for double comment)

ethaniel 1 November 25, 2020 at 8:21 PM

Looks like this bug (DDL during SST causes a block) is still active in 5.7.31.

Krunal Bauskar February 13, 2018 at 9:18 AM

Existing behavior:

  • If BACKUP LOCKS are active then following DDLs are blocked

  • If DDLs is blocked then following DML are blocked.

Fix for this bug continue to retain the same behavior except that XB now invokes BACKUP LOCKs even for InnoDB TABLE.

This means if a user has DDL active on a node acting as DONOR then DDL and following DML will be blocked too. This shouldn't be confused with BACKUP LOCKS blocking DML. DML is blocked because of DDL (due to existing PXC blocking DDL dependency).

Ramesh Sivaraman February 7, 2018 at 10:19 AM

Able to reproduce the issue

Testcase

1) started node1
2) initiated sysbench
3) started test.sh in the loop
4) started node2 with xtrabackup-v2

2018-02-07T10:21:05.618828Z WSREP_SST: [INFO] Proceeding with SST......... 2018-02-07T10:21:05.644466Z WSREP_SST: [INFO] ............Waiting for SST streaming to complete! 2018-02-07T10:21:07.231497Z 0 [Note] WSREP: (99ec8c7d, 'tcp://127.0.0.1:13208') turning message relay requesting off 2018-02-07T10:21:41.766929Z 0 [Warning] WSREP: 0.0 (qaserver-06): State transfer to 1.0 (qaserver-06) failed: -22 (Invalid argument) 2018-02-07T10:21:41.766948Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():765: Will never receive state. Need to abort. 2018-02-07T10:21:41.767016Z 0 [Note] WSREP: gcomm: terminating thread 2018-02-07T10:21:41.767029Z 0 [Note] WSREP: gcomm: joining thread 2018-02-07T10:21:41.767082Z 0 [Note] WSREP: gcomm: closing backend 2018-02-07T10:21:41.828995Z WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 2018-02-07T10:21:41.829642Z WSREP_SST: [ERROR] xtrabackup_checkpoints missing. xtrabackup/SST failed on DONOR. Check DONOR log 2018-02-07T10:21:41.830297Z WSREP_SST: [ERROR] ****************************************************** 2018-02-07T10:21:41.831052Z WSREP_SST: [ERROR] Cleanup after exit with status:2 2018-02-07T10:21:41.836445Z 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '127.0.0.1' --datadir '/home/ramesh/workdir/pxc57/node2/' --defaults-file '' --defaults-group-suffix '.1' --parent '28342' '' : 2 (No such file or directory) 2018-02-07T10:21:41.836464Z 0 [ERROR] WSREP: Failed to read uuid:seqno from joiner script. 2018-02-07T10:21:41.836470Z 0 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory) 2018-02-07T10:21:41.836532Z 0 [ERROR] WSREP: SST failed: 2 (No such file or directory) 2018-02-07T10:21:41.836547Z 0 [ERROR] Aborting
Done

Details

Assignee

Reporter

Time tracking

3h logged

Fix versions

Priority

Smart Checklist

Created February 7, 2018 at 7:57 AM
Updated November 26, 2020 at 9:49 AM
Resolved February 14, 2018 at 4:10 AM

Flag notifications