Done
Details
Assignee
UnassignedUnassignedReporter
Leonardo Bacchi FernandesLeonardo Bacchi FernandesNeeds QA
YesFix versions
Affects versions
Priority
High
Details
Details
Assignee
Unassigned
UnassignedReporter
Leonardo Bacchi Fernandes
Leonardo Bacchi FernandesNeeds QA
Yes
Fix versions
Affects versions
Priority
Smart Checklist
Smart Checklist
Smart Checklist
Created May 9, 2023 at 8:14 PM
Updated March 6, 2024 at 8:41 PM
Resolved May 19, 2023 at 7:53 AM
Node 1 of the cluster (the one receiving the writes) left with the message below:
The node 1 stays up after the gcomm abort, and the processlist shows the threads as killed, but they never finish:
Meanwhile, the other two nodes detect node 1 is down, and they remain primary, as they are still the majority of the cluster:
But both nodes are stuck with transactions on their certification queue:
And both nodes that remained in the cluster have a thread with state: Query and Info: innobase_commit_low
Attaching a GDB, we find the following applier threads:
Thread 6 is waiting for the binlog mutex:
It is being held by thread 43 (the other applier thread):
The thread has the following bt:
It seems to be waiting for its turn to commit data:
The remaining nodes are completely stuck and don't write any data.
The bug seems related to the abrupt departure of a node from the cluster. In both cases we've seen this issue, the node left with a similar "failed to replay trx" message:
Reproduced on 8.0.32. I haven't tested/reproduced on 8.0.31.
I'll leave the coredump in the comments.