Issues

Select view

Select search mode

 

Flow control flapping hangs the cluster

Done

Description

On PXC 8.0.36, a flapping flow control scenario may hang the cluster in a multi-writer environment. It also affects 5.7.44 and 5.7.25.

mysql> select @@version ; +-------------+ | @@version | +-------------+ | 8.0.36-28.1 | +-------------+

InnoDB status from the affected node shows threads in replicating state:

---TRANSACTION 8661588, ACTIVE 664 sec mysql tables in use 1, locked 1 2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 376522, OS thread handle 140572300867328, query id 2035388 localhost root wsrep: replicating and certifying write set UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test' ---TRANSACTION 8661517, ACTIVE 665 sec mysql tables in use 1, locked 1 1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1 MySQL thread id 376445, OS thread handle 140571949283072, query id 2035084 localhost root wsrep: replicating and certifying write set insert into t1 (c2) values (1)

The receive queue does not show write-sets:

mysql> show global status like '%queue%'; +----------------------------+-----------+ | Variable_name | Value | +----------------------------+-----------+ | wsrep_local_send_queue | 2 | | wsrep_local_send_queue_max | 4 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0.0116917 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 10456 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 1110.19 | +----------------------------+-----------+

And flow control is still active:

mysql> show global status like '%flow_control%'; +----------------------------------+------------------+ | Variable_name | Value | +----------------------------------+------------------+ | wsrep_flow_control_paused_ns | 2665586288434 | | wsrep_flow_control_paused | 0.492262 | | wsrep_flow_control_sent | 14251 | | wsrep_flow_control_recv | 14253 | | wsrep_flow_control_active | true | | wsrep_flow_control_requested | true | | wsrep_flow_control_interval | [ 10000, 10000 ] | | wsrep_flow_control_interval_low | 10000 | | wsrep_flow_control_interval_high | 10000 | | wsrep_flow_control_status | ON | +----------------------------------+------------------+

Node 2 and 3 also shows flow control as active:

mysql> show global status like '%flow_control%'; +----------------------------------+---------------+ | Variable_name | Value | +----------------------------------+---------------+ | wsrep_flow_control_paused_ns | 1690007980149 | | wsrep_flow_control_paused | 0.58526 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 3790 | | wsrep_flow_control_active | true | | wsrep_flow_control_requested | false | | wsrep_flow_control_interval | [ 500, 500 ] | | wsrep_flow_control_interval_low | 500 | | wsrep_flow_control_interval_high | 500 | | wsrep_flow_control_status | OFF | +----------------------------------+---------------+

Killing the threads doesn't fix the issue, the node needs to be restarted to fix the cluster:

*************************** 14. row *************************** Id: 376445 User: root Host: localhost db: test Command: Killed Time: 1188 State: wsrep: replicating and certifying write set Info: insert into t1 (c2) values (1) Time_ms: 1188294 Rows_sent: 0 Rows_examined: 0 *************************** 15. row *************************** Id: 376522 User: root Host: localhost db: test Command: Killed Time: 1187 State: wsrep: replicating and certifying write set Info: UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test' Time_ms: 1187716 Rows_sent: 0 Rows_examined: 1

How to repeat:

  1. Use the attached my.cnf to create a 3 nodes PXC 8.0.36 cluster.

  2. Create the following tables:

use test; CREATE TABLE t1 ( c1 int NOT NULL AUTO_INCREMENT, c2 int DEFAULT NULL, PRIMARY KEY (c1) ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb3; CREATE TABLE t2 ( c1 int unsigned NOT NULL AUTO_INCREMENT, c2 varchar(255) NOT NULL, c3 int unsigned NOT NULL, c4 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c5 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c6 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c7 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c8 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c9 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', PRIMARY KEY (c1) ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb3; insert into t2 values (1,'test',1,now(),now(),now(),now(),now(),now());
  1. On node 1, configure a 8M redo and strict durability settings:

innodb_log_file_size=8M sync_binlog=1 innodb_flush_log_at_trx_commit=1
  1. On node 1, run the following command to produce a flow control flapping behavior:

while true; do { mysql -e "set global wsrep_provider_options = 'gcs.fc_limit=1'"; sleep 0.1; mysql -e "set global wsrep_provider_options = 'gcs.fc_limit=10000'"; } done;

And run the following workload:

while true; do { mysql -B -N test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'; select now(),c6 from t2;"; } done; while true; do { mysql -B -N test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'; select now(),c6 from t2;"; } done; while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done;
  1. On node 2, run the following commands:

while true; do { mysql test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test';"; } done; while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done; while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done;

Monitor the flow control on node 1, you may need adding more inserts in case the flapping happens between several seconds.

Since it’s a race condition, it may take seconds to minutes to trigger the bug.

Environment

None

AFFECTED CS IDs

CS0046107

Attachments

4
  • 18 Jul 2024, 09:01 AM
  • 18 Jul 2024, 09:00 AM
  • 18 Jul 2024, 09:00 AM
  • 13 Jul 2024, 02:24 AM

blocks

Details

Assignee

Reporter

Needs QA

Yes

In progress time

6.25

Time tracking

No time logged1w 1d 2h remaining

Sprint

Affects versions

Priority

Smart Checklist

Created July 13, 2024 at 2:24 AM
Updated December 23, 2024 at 11:40 AM
Resolved September 27, 2024 at 7:45 AM

Activity

Show:

Kamil Holubicki September 27, 2024 at 7:44 AM

, yes

Scott Hooper September 26, 2024 at 5:28 PM

Did this make it in the 8.0.37-29 released code?

Aaditya Dubey July 19, 2024 at 6:43 AM

Hi

Please find the steps below:

step1: Clone anydbver from https://github.com/ihanick/anydbver

step2: Navigate to following path and add the my.cnf options:

vim anydbver/configs/mysql/pxc8-repl-gtid.cnf

step3: Add following options to pxc8-repl-gtid.cnf and save-exit

userstat = ON gtid_mode = ON sync_binlog = 0 wait_timeout = 3600 innodb_flush_method = O_DIRECT_NO_FSYNC interactive_timeout = 3600 max_heap_table_size = 64M slave_parallel_type = LOGICAL_CLOCK wsrep_slave_threads = 10 innodb_log_file_size = 2G innodb_purge_threads = 8 innodb_lru_scan_depth = 2048 skip-external-locking transaction-isolation = READ-COMMITTED innodb_read_io_threads = 32 innodb_sync_array_size = 16 master_info_repository = TABLE slave_parallel_workers = 16 enforce-gtid-consistency innodb_log_files_in_group = 4 table_open_cache_instances = 16 innodb_buffer_pool_instances = 8 performance-schema-instrument = 'wait/io/socket/sql/%=ON' performance-schema-instrument = 'wait/lock/metadata/sql/mdl=ON' innodb_flush_log_at_trx_commit = 0 performance_schema_digests_size = 50000 performance-schema-consumer-statements_digest = ON performance-schema-consumer-events_waits_current = ON performance-schema-consumer-events_waits_history = ON performance-schema-consumer-global_instrumentation = ON performance-schema-consumer-thread_instrumentation = ON performance-schema-consumer-events_statements_current = ON performance-schema-consumer-events_statements_history = ON wsrep_provider_options = "gcache.size=1G;gcs.fc_limit=500;gmcast.segment=2;gcs.fc_master_slave=YES" plugin_load = "semisync_source.so;semisync_replica.so"

Step4: Deploy PXC 8.0.36 using anydbver by the following command:

./anydbver deploy pxc:8.0.36 node1 pxc:8.0.36 galera-master:default node2 pxc:8.0.36 galera-master:default

step5: connect to node1 and just type mysql in the node1 terminal and you will be in

cd anydbver ./anydbver ssh node1 Run following SQL: create database if not exists test; use test; CREATE TABLE t1 ( c1 int NOT NULL AUTO_INCREMENT, c2 int DEFAULT NULL, PRIMARY KEY (c1) ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb3; CREATE TABLE t2 ( c1 int unsigned NOT NULL AUTO_INCREMENT, c2 varchar(255) NOT NULL, c3 int unsigned NOT NULL, c4 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c5 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c6 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c7 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c8 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c9 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', PRIMARY KEY (c1) ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb3; insert into t2 values (1,'test',1,now(),now(),now(),now(),now(),now());

step6: Now navigate to my.cnf file in node1 and add following parameters and save-exit

vim /etc/my.cnf innodb_log_file_size=8M sync_binlog=1 innodb_flush_log_at_trx_commit=1

step7: restart node1:

sudo systemctl restart mysql

step8: Run following commends in node1 terminal in background:

while true; do { mysql -e "set global wsrep_provider_options = 'gcs.fc_limit=1'"; sleep 0.1; mysql -e "set global wsrep_provider_options = 'gcs.fc_limit=10000'"; } done & while true; do { mysql -B -N test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'; select now(),c6 from t2;"; } done & while true; do { mysql -B -N test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'; select now(),c6 from t2;"; } done & while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done &

step9:similar way run following set of command in node2:

./anydbver ssh node2 while true; do { mysql test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test';"; } done & while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done & while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done &

step10: let it run for a few seconds to minutes and connect to node1 mysql client and observe flows by using following set of commands accordingly:

show global status like '%flow_control%'; show global status like '%queue%'; show engine innodb status\G show full processlists\G

step11: Once you start seeing | wsrep_local_recv_queue     | 0          | try killing those queries and also check INNODB status where you will stuck transactions:

---TRANSACTION 128168, ACTIVE 272 sec mysql tables in use 1, locked 1 1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1 MySQL thread id 69578, OS thread handle 140640859895552, query id 296093 localhost root wsrep: replicating and certifying write set insert into t1 (c2) values (1) ---TRANSACTION 128157, ACTIVE 763 sec mysql tables in use 1, locked 1 2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 62223, OS thread handle 140638903322368, query id 273955 localhost root wsrep: replicating and certifying write set UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test' ---TRANSACTION 128149, ACTIVE 763 sec mysql tables in use 1, locked 1 1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1 MySQL thread id 62214, OS thread handle 140640857782016, query id 273922 localhost root wsrep: replicating and certifying write set insert into t1 (c2) values (1) *************************** 14. row *************************** Id: 62214 User: root Host: localhost db: test Command: Killed Time: 804 State: wsrep: replicating and certifying write set Info: insert into t1 (c2) values (1) Time_ms: 803418 Rows_sent: 0 Rows_examined: 0 *************************** 15. row *************************** Id: 62223 User: root Host: localhost db: test Command: Killed Time: 803 State: wsrep: replicating and certifying write set Info: UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test' Time_ms: 803141 Rows_sent: 0 Rows_examined: 1 *************************** 16. row ***************************

Aaditya Dubey July 18, 2024 at 5:37 PM

Hi

I’m able to repeat the behaviour as described.

---TRANSACTION 128168, ACTIVE 272 sec mysql tables in use 1, locked 1 1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1 MySQL thread id 69578, OS thread handle 140640859895552, query id 296093 localhost root wsrep: replicating and certifying write set insert into t1 (c2) values (1) ---TRANSACTION 128157, ACTIVE 763 sec mysql tables in use 1, locked 1 2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 62223, OS thread handle 140638903322368, query id 273955 localhost root wsrep: replicating and certifying write set UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test' ---TRANSACTION 128149, ACTIVE 763 sec mysql tables in use 1, locked 1 1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1 MySQL thread id 62214, OS thread handle 140640857782016, query id 273922 localhost root wsrep: replicating and certifying write set insert into t1 (c2) values (1) *************************** 14. row *************************** Id: 62214 User: root Host: localhost db: test Command: Killed Time: 804 State: wsrep: replicating and certifying write set Info: insert into t1 (c2) values (1) Time_ms: 803418 Rows_sent: 0 Rows_examined: 0 *************************** 15. row *************************** Id: 62223 User: root Host: localhost db: test Command: Killed Time: 803 State: wsrep: replicating and certifying write set Info: UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test' Time_ms: 803141 Rows_sent: 0 Rows_examined: 1 *************************** 16. row ***************************

Kamil Holubicki July 18, 2024 at 8:59 AM

Hi , Unfortunately, I’m not able to reproduce. I tried for several hours and nothing

Here is my setup:

  1. PXC 8.0.36

  2. Use the config file attached n1.cnf (modify to node2 and node3 according to the comments around line 46

  3. Start the cluster of 3 nodes

  4. Start node-1-run.sh

  5. Wait until db is set up and the workload starts

  6. Start node-2-run.sh

  7. Wait

 

I tried with a different number of insert workloads as suggested, but unfortunately, I’m not able to reproduce the issue. Maybe I’m doing something wrong?