A node abruptly leaving causes the applier thread to hang on all other nodes remaining in the cluster

Description

Node 1 of the cluster (the one receiving the writes) left with the message below:

The node 1 stays up after the gcomm abort, and the processlist shows the threads as killed, but they never finish:

Meanwhile, the other two nodes detect node 1 is down, and they remain primary, as they are still the majority of the cluster: 

But both nodes are stuck with transactions on their certification queue:

And both nodes that remained in the cluster have a thread with state: Query and Info: innobase_commit_low 

Attaching a GDB, we find the following applier threads:

Thread 6 is waiting for the binlog mutex:

It is being held by thread 43 (the other applier thread):

The thread has the following bt:

It seems to be waiting for its turn to commit data:

The remaining nodes are completely stuck and don't write any data. 

 

The bug seems related to the abrupt departure of a node from the cluster. In both cases we've seen this issue, the node left with a similar "failed to replay trx" message:

Reproduced on 8.0.32. I haven't tested/reproduced on 8.0.31. 

 

I'll leave the coredump in the comments.

Environment

None

AFFECTED CS IDs

CS0035963

Activity

Leonardo Bacchi Fernandes May 10, 2023 at 5:11 PM

Hello Kamil, 

 

I reproduced by running a sysbench tpcc prepare on a 3-node PXC. I noticed you marked as a duplicate for , but I just wanted to mention an upgrade was not involved in this case.

Kamil Holubicki May 10, 2023 at 1:43 PM

This will be fixed by in 8.0.33

Kamil Holubicki May 10, 2023 at 6:05 AM
Edited

Hi , What are the steps to reproduce?

 

Seems to be also the same as https://jira.percona.com/browse/PXC-4211, which is actually a good news

 

Should be relatively easy to reproduce with --max-binlog-size=SOME_SMALL_VALUE

Kamil Holubicki May 10, 2023 at 5:59 AM

Done

Details

Assignee

Reporter

Needs QA

Yes

Affects versions

Priority

Smart Checklist

Created May 9, 2023 at 8:14 PM
Updated March 6, 2024 at 8:41 PM
Resolved May 19, 2023 at 7:53 AM