A node abruptly leaving causes the applier thread to hang on all other nodes remaining in the cluster

General

Escalation

General

Escalation

Description

Node 1 of the cluster (the one receiving the writes) left with the message below:

The node 1 stays up after the gcomm abort, and the processlist shows the threads as killed, but they never finish:

Meanwhile, the other two nodes detect node 1 is down, and they remain primary, as they are still the majority of the cluster:

But both nodes are stuck with transactions on their certification queue:

And both nodes that remained in the cluster have a thread with state: Query and Info: innobase_commit_low

Attaching a GDB, we find the following applier threads:

Thread 6 is waiting for the binlog mutex:

It is being held by thread 43 (the other applier thread):

The thread has the following bt:

It seems to be waiting for its turn to commit data:

The remaining nodes are completely stuck and don't write any data.

The bug seems related to the abrupt departure of a node from the cluster. In both cases we've seen this issue, the node left with a similar "failed to replay trx" message:

Reproduced on 8.0.32. I haven't tested/reproduced on 8.0.31.

I'll leave the coredump in the comments.

Environment

None

AFFECTED CS IDs

CS0035963

Linked issues

duplicates

PXC-4211

Server aborts on binary log rotation

is duplicated by

PXC-4217

XtraDB Node failure on [Galera] Invalid state in replay for trx

Activity

Leonardo Bacchi Fernandes May 10, 2023 at 5:11 PM

Hello Kamil,

I reproduced by running a sysbench tpcc prepare on a 3-node PXC. I noticed you marked as a duplicate for , but I just wanted to mention an upgrade was not involved in this case.

Kamil Holubicki May 10, 2023 at 1:43 PM

This will be fixed by in 8.0.33

Kamil Holubicki May 10, 2023 at 6:05 AM
Edited

Hi , What are the steps to reproduce?

Seems to be also the same as https://jira.percona.com/browse/PXC-4211, which is actually a good news

Should be relatively easy to reproduce with --max-binlog-size=SOME_SMALL_VALUE

Kamil Holubicki May 10, 2023 at 5:59 AM

Looks like the same https://jira.percona.com/browse/PXC-4217

Done

Details
Assignee
Unassigned
Reporter
Leonardo Bacchi Fernandes
Needs QA
Yes
Fix versions
8.0.33-25 (Q2 2023)
8.0.32-24.2 (Q1 2023)
Affects versions
8.0.32-24 (Q1 2023)
Priority
High

Smart Checklist

Created May 9, 2023 at 8:14 PM

Updated March 6, 2024 at 8:41 PM

Resolved May 19, 2023 at 7:53 AM

A node abruptly leaving causes the applier thread to hang on all other nodes remaining in the cluster

Description

Environment

AFFECTED CS IDs

Linked issues

duplicates

is duplicated by

Activity

Leonardo Bacchi Fernandes May 10, 2023 at 5:11 PM

Kamil Holubicki May 10, 2023 at 1:43 PM

Kamil Holubicki May 10, 2023 at 6:05 AMEdited

Kamil Holubicki May 10, 2023 at 5:59 AM

DetailsAssigneeUnassignedUnassignedReporterLeonardo Bacchi FernandesLeonardo Bacchi FernandesNeeds QAYesFix versions8.0.33-25 (Q2 2023)8.0.32-24.2 (Q1 2023)Affects versions8.0.32-24 (Q1 2023)PriorityHigh

Details

Assignee

Reporter

Needs QA

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Kamil Holubicki May 10, 2023 at 6:05 AM
Edited

Details
Assignee
Unassigned
Reporter
Leonardo Bacchi Fernandes
Needs QA
Yes
Fix versions
8.0.33-25 (Q2 2023)
8.0.32-24.2 (Q1 2023)
Affects versions
8.0.32-24 (Q1 2023)
Priority
High

Smart Checklist