Done
Details
Assignee
Lalit ChoudharyLalit ChoudharyReporter
Przemyslaw MalkowskiPrzemyslaw MalkowskiPlanned Version/s
Needs QA
NoIn progress time
1.5Time tracking
No time logged1d 4h remainingSprint
NoneFix versions
Affects versions
Priority
High
Details
Details
Assignee
Lalit Choudhary
Lalit ChoudharyReporter
Przemyslaw Malkowski
Przemyslaw MalkowskiPlanned Version/s
Needs QA
No
In progress time
1.5
Time tracking
No time logged1d 4h remaining
Sprint
None
Fix versions
Affects versions
Priority
Smart Checklist
Smart Checklist
Smart Checklist
Created September 16, 2024 at 1:20 PM
Updated March 13, 2025 at 9:05 AM
Resolved October 21, 2024 at 4:00 PM
The problem happens when the cluster is under intensive application load, and one node is restarted.
IST or SST fails with a timeout (does not even start sending events), leaving the donor in a confused state. After a failed join attempt, the donor never fully switches back to the SYNCED state. Instead, it shows
wsrep_local_state_comment
asSYNCED
, but never prints the expectedShifting ...
error log message about changing state, as well as its state inpxc_cluster_view
keeps as DONOR. As such, the donor remains like that forever, despite the fact the joiner is already gone.In the wsrep_incoming_addresses, the failed joiner is never removed, and the cluster size remains at 3, which is a clear issue in setting the correct cluster view after joining failure.
Moreover, we could see that even after stopping the failed donor, the remained 3rd node, which did not even participate in SST/IST, is also in a bad state, as the cluster view reports two empty entries:
mysql> select * from performance_schema.pxc_cluster_view; +----------------+--------------------------------------+--------+-------------+---------+ | HOST_NAME | UUID | STATUS | LOCAL_INDEX | SEGMENT | +----------------+--------------------------------------+--------+-------------+---------+ | pxc-2 | 761a772c-7421-11ef-a683-ffcb20bd99e4 | SYNCED | 0 | 0 | | | | | 0 | 0 | | | | | 0 | 0 | +----------------+--------------------------------------+--------+-------------+---------+ 3 rows in set (0.00 sec)
The only way to recover from this state is to re-bootstrap the whole cluster.
Attached is the pmp trace from the ex-donor being in the halfway state.