Flow control flapping hangs the cluster

Description

On PXC 8.0.36, a flapping flow control scenario may hang the cluster in a multi-writer environment. It also affects 5.7.44 and 5.7.25.

InnoDB status from the affected node shows threads in replicating state:

The receive queue does not show write-sets:

And flow control is still active:

Node 2 and 3 also shows flow control as active:

Killing the threads doesn't fix the issue, the node needs to be restarted to fix the cluster:

How to repeat:

  1. Use the attached my.cnf to create a 3 nodes PXC 8.0.36 cluster.

  2. Create the following tables:

  1. On node 1, configure a 8M redo and strict durability settings:

  1. On node 1, run the following command to produce a flow control flapping behavior:

And run the following workload:

  1. On node 2, run the following commands:

Monitor the flow control on node 1, you may need adding more inserts in case the flapping happens between several seconds.

Since it’s a race condition, it may take seconds to minutes to trigger the bug.

Environment

None

AFFECTED CS IDs

CS0046107

Attachments

4

blocks

Activity

Show:

Kamil Holubicki September 27, 2024 at 7:44 AM

, yes

Scott Hooper September 26, 2024 at 5:28 PM

Did this make it in the 8.0.37-29 released code?

Aaditya Dubey July 19, 2024 at 6:43 AM

Hi

Please find the steps below:

step1: Clone anydbver from

step2: Navigate to following path and add the my.cnf options:

step3: Add following options to pxc8-repl-gtid.cnf and save-exit

Step4: Deploy PXC 8.0.36 using anydbver by the following command:

step5: connect to node1 and just type mysql in the node1 terminal and you will be in

step6: Now navigate to my.cnf file in node1 and add following parameters and save-exit

step7: restart node1:

step8: Run following commends in node1 terminal in background:

step9:similar way run following set of command in node2:

step10: let it run for a few seconds to minutes and connect to node1 mysql client and observe flows by using following set of commands accordingly:

step11: Once you start seeing | wsrep_local_recv_queue     | 0          | try killing those queries and also check INNODB status where you will stuck transactions:

Aaditya Dubey July 18, 2024 at 5:37 PM

Hi

I’m able to repeat the behaviour as described.

Kamil Holubicki July 18, 2024 at 8:59 AM

Hi , Unfortunately, I’m not able to reproduce. I tried for several hours and nothing

Here is my setup:

  1. PXC 8.0.36

  2. Use the config file attached n1.cnf (modify to node2 and node3 according to the comments around line 46

  3. Start the cluster of 3 nodes

  4. Start node-1-run.sh

  5. Wait until db is set up and the workload starts

  6. Start node-2-run.sh

  7. Wait

 

I tried with a different number of insert workloads as suggested, but unfortunately, I’m not able to reproduce the issue. Maybe I’m doing something wrong?

Done

Details

Assignee

Reporter

Needs QA

In progress time

Time tracking

No time logged1w 1d 2h remaining

Sprint

Affects versions

Priority

Smart Checklist

Created July 13, 2024 at 2:24 AM
Updated December 23, 2024 at 11:40 AM
Resolved September 27, 2024 at 7:45 AM