Issues
- Flow control flapping hangs the clusterPXC-4453Resolved issue: PXC-4453Kamil Holubicki
- Telemetry Ph1 - adjustments needed for PXCPXC-4436Resolved issue: PXC-4436Kamil Holubicki
- Garbd terminates SST script's children after SST is completePXC-4428Resolved issue: PXC-4428Kamil Holubicki
- PXC 8.4.0PXC-4408Resolved issue: PXC-4408Kamil Holubicki
- Innodb semaphore wait timeout failure after upgrade from 8.0.34 to 8.0.35PXC-4367Resolved issue: PXC-4367Kamil Holubicki
- PXC node evicted when create function by user don't have super privilege and binary loggin is enabledPXC-4362Resolved issue: PXC-4362Kamil Holubicki
- Cluster state interruption with MDL BF-BF conflict and exec-mode:toiPXC-4348Resolved issue: PXC-4348Kamil Holubicki
- Statements executed in RSU mode generate local GTID eventsPXC-4313Resolved issue: PXC-4313parveez.baig
- ALTER TABLE causes wsrep_cluster_status DisconnectedPXC-4298Resolved issue: PXC-4298
Flow control flapping hangs the cluster
Description
Environment
AFFECTED CS IDs
Attachments
blocks
Details
Details
Assignee
Reporter
Labels
Needs QA
In progress time
Time tracking
Sprint
Fix versions
Affects versions
Priority
Smart Checklist
Smart Checklist
Activity
Kamil Holubicki September 27, 2024 at 7:44 AM
@Scott Hooper , yes
Scott Hooper September 26, 2024 at 5:28 PM
Did this make it in the 8.0.37-29 released code?
Aaditya Dubey July 19, 2024 at 6:43 AM
Hi @Kamil Holubicki
Please find the steps below:
step1:
Clone anydbver from https://github.com/ihanick/anydbver
step2:
Navigate to following path and add the my.cnf options:
vim anydbver/configs/mysql/pxc8-repl-gtid.cnf
step3:
Add following options to pxc8-repl-gtid.cnf
and save-exit
userstat = ON
gtid_mode = ON
sync_binlog = 0
wait_timeout = 3600
innodb_flush_method = O_DIRECT_NO_FSYNC
interactive_timeout = 3600
max_heap_table_size = 64M
slave_parallel_type = LOGICAL_CLOCK
wsrep_slave_threads = 10
innodb_log_file_size = 2G
innodb_purge_threads = 8
innodb_lru_scan_depth = 2048
skip-external-locking
transaction-isolation = READ-COMMITTED
innodb_read_io_threads = 32
innodb_sync_array_size = 16
master_info_repository = TABLE
slave_parallel_workers = 16
enforce-gtid-consistency
innodb_log_files_in_group = 4
table_open_cache_instances = 16
innodb_buffer_pool_instances = 8
performance-schema-instrument = 'wait/io/socket/sql/%=ON'
performance-schema-instrument = 'wait/lock/metadata/sql/mdl=ON'
innodb_flush_log_at_trx_commit = 0
performance_schema_digests_size = 50000
performance-schema-consumer-statements_digest = ON
performance-schema-consumer-events_waits_current = ON
performance-schema-consumer-events_waits_history = ON
performance-schema-consumer-global_instrumentation = ON
performance-schema-consumer-thread_instrumentation = ON
performance-schema-consumer-events_statements_current = ON
performance-schema-consumer-events_statements_history = ON
wsrep_provider_options = "gcache.size=1G;gcs.fc_limit=500;gmcast.segment=2;gcs.fc_master_slave=YES"
plugin_load = "semisync_source.so;semisync_replica.so"
Step4:
Deploy PXC 8.0.36 using anydbver by the following command:
./anydbver deploy pxc:8.0.36 node1 pxc:8.0.36 galera-master:default node2 pxc:8.0.36 galera-master:default
step5:
connect to node1 and just type mysql
in the node1 terminal and you will be in
cd anydbver
./anydbver ssh node1
Run following SQL:
create database if not exists test;
use test;
CREATE TABLE t1 (
c1 int NOT NULL AUTO_INCREMENT,
c2 int DEFAULT NULL,
PRIMARY KEY (c1)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb3;
CREATE TABLE t2 (
c1 int unsigned NOT NULL AUTO_INCREMENT,
c2 varchar(255) NOT NULL,
c3 int unsigned NOT NULL,
c4 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00',
c5 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00',
c6 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00',
c7 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00',
c8 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00',
c9 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00',
PRIMARY KEY (c1)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb3;
insert into t2 values (1,'test',1,now(),now(),now(),now(),now(),now());
step6:
Now navigate to my.cnf file in node1 and add following parameters and save-exit
vim /etc/my.cnf
innodb_log_file_size=8M
sync_binlog=1
innodb_flush_log_at_trx_commit=1
step7:
restart node1:
sudo systemctl restart mysql
step8:
Run following commends in node1 terminal in background:
while true; do { mysql -e "set global wsrep_provider_options = 'gcs.fc_limit=1'"; sleep 0.1; mysql -e "set global wsrep_provider_options = 'gcs.fc_limit=10000'"; } done &
while true; do { mysql -B -N test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'; select now(),c6 from t2;"; } done &
while true; do { mysql -B -N test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'; select now(),c6 from t2;"; } done &
while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done &
step9:
similar way run following set of command in node2:
./anydbver ssh node2
while true; do { mysql test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test';"; } done &
while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done &
while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done &
step10:
let it run for a few seconds to minutes and connect to node1 mysql client and observe flows by using following set of commands accordingly:
show global status like '%flow_control%';
show global status like '%queue%';
show engine innodb status\G
show full processlists\G
step11:
Once you start seeing | wsrep_local_recv_queue | 0 |
try killing those queries and also check INNODB status where you will stuck transactions:
---TRANSACTION 128168, ACTIVE 272 sec
mysql tables in use 1, locked 1
1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1
MySQL thread id 69578, OS thread handle 140640859895552, query id 296093 localhost root wsrep: replicating and certifying write set
insert into t1 (c2) values (1)
---TRANSACTION 128157, ACTIVE 763 sec
mysql tables in use 1, locked 1
2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1
MySQL thread id 62223, OS thread handle 140638903322368, query id 273955 localhost root wsrep: replicating and certifying write set
UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'
---TRANSACTION 128149, ACTIVE 763 sec
mysql tables in use 1, locked 1
1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1
MySQL thread id 62214, OS thread handle 140640857782016, query id 273922 localhost root wsrep: replicating and certifying write set
insert into t1 (c2) values (1)
*************************** 14. row ***************************
Id: 62214
User: root
Host: localhost
db: test
Command: Killed
Time: 804
State: wsrep: replicating and certifying write set
Info: insert into t1 (c2) values (1)
Time_ms: 803418
Rows_sent: 0
Rows_examined: 0
*************************** 15. row ***************************
Id: 62223
User: root
Host: localhost
db: test
Command: Killed
Time: 803
State: wsrep: replicating and certifying write set
Info: UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'
Time_ms: 803141
Rows_sent: 0
Rows_examined: 1
*************************** 16. row ***************************
Aaditya Dubey July 18, 2024 at 5:37 PM
Hi @Kamil Holubicki
I’m able to repeat the behaviour as described.
---TRANSACTION 128168, ACTIVE 272 sec
mysql tables in use 1, locked 1
1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1
MySQL thread id 69578, OS thread handle 140640859895552, query id 296093 localhost root wsrep: replicating and certifying write set
insert into t1 (c2) values (1)
---TRANSACTION 128157, ACTIVE 763 sec
mysql tables in use 1, locked 1
2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1
MySQL thread id 62223, OS thread handle 140638903322368, query id 273955 localhost root wsrep: replicating and certifying write set
UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'
---TRANSACTION 128149, ACTIVE 763 sec
mysql tables in use 1, locked 1
1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1
MySQL thread id 62214, OS thread handle 140640857782016, query id 273922 localhost root wsrep: replicating and certifying write set
insert into t1 (c2) values (1)
*************************** 14. row ***************************
Id: 62214
User: root
Host: localhost
db: test
Command: Killed
Time: 804
State: wsrep: replicating and certifying write set
Info: insert into t1 (c2) values (1)
Time_ms: 803418
Rows_sent: 0
Rows_examined: 0
*************************** 15. row ***************************
Id: 62223
User: root
Host: localhost
db: test
Command: Killed
Time: 803
State: wsrep: replicating and certifying write set
Info: UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'
Time_ms: 803141
Rows_sent: 0
Rows_examined: 1
*************************** 16. row ***************************
Kamil Holubicki July 18, 2024 at 8:59 AM
Hi @Aaditya Dubey @Juan Arruti , Unfortunately, I’m not able to reproduce. I tried for several hours and nothing
Here is my setup:
PXC 8.0.36
Use the config file attached n1.cnf (modify to node2 and node3 according to the comments around line 46
Start the cluster of 3 nodes
Start node-1-run.sh
Wait until db is set up and the workload starts
Start node-2-run.sh
Wait
I tried with a different number of insert workloads as suggested, but unfortunately, I’m not able to reproduce the issue. Maybe I’m doing something wrong?
On PXC 8.0.36, a flapping flow control scenario may hang the cluster in a multi-writer environment. It also affects 5.7.44 and 5.7.25.
mysql> select @@version ; +-------------+ | @@version | +-------------+ | 8.0.36-28.1 | +-------------+
InnoDB status from the affected node shows threads in replicating state:
---TRANSACTION 8661588, ACTIVE 664 sec mysql tables in use 1, locked 1 2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1 MySQL thread id 376522, OS thread handle 140572300867328, query id 2035388 localhost root wsrep: replicating and certifying write set UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test' ---TRANSACTION 8661517, ACTIVE 665 sec mysql tables in use 1, locked 1 1 lock struct(s), heap size 1128, 0 row lock(s), undo log entries 1 MySQL thread id 376445, OS thread handle 140571949283072, query id 2035084 localhost root wsrep: replicating and certifying write set insert into t1 (c2) values (1)
The receive queue does not show write-sets:
mysql> show global status like '%queue%'; +----------------------------+-----------+ | Variable_name | Value | +----------------------------+-----------+ | wsrep_local_send_queue | 2 | | wsrep_local_send_queue_max | 4 | | wsrep_local_send_queue_min | 0 | | wsrep_local_send_queue_avg | 0.0116917 | | wsrep_local_recv_queue | 0 | | wsrep_local_recv_queue_max | 10456 | | wsrep_local_recv_queue_min | 0 | | wsrep_local_recv_queue_avg | 1110.19 | +----------------------------+-----------+
And flow control is still active:
mysql> show global status like '%flow_control%'; +----------------------------------+------------------+ | Variable_name | Value | +----------------------------------+------------------+ | wsrep_flow_control_paused_ns | 2665586288434 | | wsrep_flow_control_paused | 0.492262 | | wsrep_flow_control_sent | 14251 | | wsrep_flow_control_recv | 14253 | | wsrep_flow_control_active | true | | wsrep_flow_control_requested | true | | wsrep_flow_control_interval | [ 10000, 10000 ] | | wsrep_flow_control_interval_low | 10000 | | wsrep_flow_control_interval_high | 10000 | | wsrep_flow_control_status | ON | +----------------------------------+------------------+
Node 2 and 3 also shows flow control as active:
mysql> show global status like '%flow_control%'; +----------------------------------+---------------+ | Variable_name | Value | +----------------------------------+---------------+ | wsrep_flow_control_paused_ns | 1690007980149 | | wsrep_flow_control_paused | 0.58526 | | wsrep_flow_control_sent | 0 | | wsrep_flow_control_recv | 3790 | | wsrep_flow_control_active | true | | wsrep_flow_control_requested | false | | wsrep_flow_control_interval | [ 500, 500 ] | | wsrep_flow_control_interval_low | 500 | | wsrep_flow_control_interval_high | 500 | | wsrep_flow_control_status | OFF | +----------------------------------+---------------+
Killing the threads doesn't fix the issue, the node needs to be restarted to fix the cluster:
*************************** 14. row *************************** Id: 376445 User: root Host: localhost db: test Command: Killed Time: 1188 State: wsrep: replicating and certifying write set Info: insert into t1 (c2) values (1) Time_ms: 1188294 Rows_sent: 0 Rows_examined: 0 *************************** 15. row *************************** Id: 376522 User: root Host: localhost db: test Command: Killed Time: 1187 State: wsrep: replicating and certifying write set Info: UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test' Time_ms: 1187716 Rows_sent: 0 Rows_examined: 1
How to repeat:
Use the attached my.cnf to create a 3 nodes PXC 8.0.36 cluster.
Create the following tables:
use test; CREATE TABLE t1 ( c1 int NOT NULL AUTO_INCREMENT, c2 int DEFAULT NULL, PRIMARY KEY (c1) ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb3; CREATE TABLE t2 ( c1 int unsigned NOT NULL AUTO_INCREMENT, c2 varchar(255) NOT NULL, c3 int unsigned NOT NULL, c4 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c5 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c6 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c7 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c8 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', c9 timestamp NOT NULL DEFAULT '2000-01-01 00:00:00', PRIMARY KEY (c1) ) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb3; insert into t2 values (1,'test',1,now(),now(),now(),now(),now(),now());
On node 1, configure a 8M redo and strict durability settings:
innodb_log_file_size=8M sync_binlog=1 innodb_flush_log_at_trx_commit=1
On node 1, run the following command to produce a flow control flapping behavior:
while true; do { mysql -e "set global wsrep_provider_options = 'gcs.fc_limit=1'"; sleep 0.1; mysql -e "set global wsrep_provider_options = 'gcs.fc_limit=10000'"; } done;
And run the following workload:
while true; do { mysql -B -N test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'; select now(),c6 from t2;"; } done; while true; do { mysql -B -N test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test'; select now(),c6 from t2;"; } done; while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done;
On node 2, run the following commands:
while true; do { mysql test -e "UPDATE t2 SET c5 = now(), c6 = now() where c3 = 1 and c2 = 'test';"; } done; while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done; while true; do { mysql test -e "insert into t1 (c2) values (1)"; } done;
Monitor the flow control on node 1, you may need adding more inserts in case the flapping happens between several seconds.
Since it’s a race condition, it may take seconds to minutes to trigger the bug.