Flow control flapping hangs the cluster

General

Escalation

General

Escalation

Description

On PXC 8.0.36, a flapping flow control scenario may hang the cluster in a multi-writer environment. It also affects 5.7.44 and 5.7.25.

InnoDB status from the affected node shows threads in replicating state:

The receive queue does not show write-sets:

And flow control is still active:

Node 2 and 3 also shows flow control as active:

Killing the threads doesn't fix the issue, the node needs to be restarted to fix the cluster:

How to repeat:

Use the attached my.cnf to create a 3 nodes PXC 8.0.36 cluster.
Create the following tables:

On node 1, configure a 8M redo and strict durability settings:

On node 1, run the following command to produce a flow control flapping behavior:

And run the following workload:

On node 2, run the following commands:

Monitor the flow control on node 1, you may need adding more inserts in case the flapping happens between several seconds.

Since it’s a race condition, it may take seconds to minutes to trigger the bug.

Environment

None

AFFECTED CS IDs

CS0046107

Attachments

Linked work items

blocks

PXC-4407

PXC 8.0.37

Activity

Show:

Kamil Holubicki September 27, 2024 at 7:44 AM

, yes

Scott Hooper September 26, 2024 at 5:28 PM

Did this make it in the 8.0.37-29 released code?

Aaditya Dubey July 19, 2024 at 6:43 AM

Hi

Please find the steps below:

step1: Clone anydbver from

step2: Navigate to following path and add the my.cnf options:

step3: Add following options to pxc8-repl-gtid.cnf and save-exit

Step4: Deploy PXC 8.0.36 using anydbver by the following command:

step5: connect to node1 and just type mysql in the node1 terminal and you will be in

step6: Now navigate to my.cnf file in node1 and add following parameters and save-exit

step7: restart node1:

step8: Run following commends in node1 terminal in background:

step9:similar way run following set of command in node2:

step10: let it run for a few seconds to minutes and connect to node1 mysql client and observe flows by using following set of commands accordingly:

step11: Once you start seeing | wsrep_local_recv_queue | 0 | try killing those queries and also check INNODB status where you will stuck transactions:

Aaditya Dubey July 18, 2024 at 5:37 PM

I’m able to repeat the behaviour as described.

Kamil Holubicki July 18, 2024 at 8:59 AM

Hi , Unfortunately, I’m not able to reproduce. I tried for several hours and nothing

Here is my setup:

PXC 8.0.36
Use the config file attached n1.cnf (modify to node2 and node3 according to the comments around line 46
Start the cluster of 3 nodes
Start node-1-run.sh
Wait until db is set up and the workload starts
Start node-2-run.sh
Wait

I tried with a different number of insert workloads as suggested, but unfortunately, I’m not able to reproduce the issue. Maybe I’m doing something wrong?

Resize issue view side panel

Done

Details

Assignee

Kamil Holubicki

Reporter

Juan Arruti

Labels

cs-tag-011

Needs QA

Yes

In progress time

6.25

Time tracking

No time logged1w 1d 2h remaining

Sprint

None

Fix versions

8.0.37-29 (Q2 2024)

8.4.0 (Q2 2024)

Affects versions

8.0.36-28 (Q1 2024)

Priority

Medium

Smart Checklist

Created July 13, 2024 at 2:24 AM

Updated December 23, 2024 at 11:40 AM

Resolved September 27, 2024 at 7:45 AM