Cluster enters a broken state after failed node restart

Description

The problem happens when the cluster is under intensive application load, and one node is restarted.

IST or SST fails with a timeout (does not even start sending events), leaving the donor in a confused state. After a failed join attempt, the donor never fully switches back to the SYNCED state. Instead, it shows wsrep_local_state_comment as SYNCED, but never prints the expected Shifting ... error log message about changing state, as well as its state in pxc_cluster_view keeps as DONOR. As such, the donor remains like that forever, despite the fact the joiner is already gone.

In the wsrep_incoming_addresses, the failed joiner is never removed, and the cluster size remains at 3, which is a clear issue in setting the correct cluster view after joining failure.

Moreover, we could see that even after stopping the failed donor, the remained 3rd node, which did not even participate in SST/IST, is also in a bad state, as the cluster view reports two empty entries:

mysql> select * from performance_schema.pxc_cluster_view; +----------------+--------------------------------------+--------+-------------+---------+ | HOST_NAME      | UUID                                 | STATUS | LOCAL_INDEX | SEGMENT | +----------------+--------------------------------------+--------+-------------+---------+ | pxc-2 | 761a772c-7421-11ef-a683-ffcb20bd99e4 | SYNCED |           0 |       0 | |                |                                      |        |           0 |       0 | |                |                                      |        |           0 |       0 | +----------------+--------------------------------------+--------+-------------+---------+ 3 rows in set (0.00 sec)

The only way to recover from this state is to re-bootstrap the whole cluster.

Attached is the pmp trace from the ex-donor being in the halfway state.

Environment

None

AFFECTED CS IDs

CS0049663

Attachments

1
  • 16 Sep 2024, 01:20 PM

Activity

Show:
Done

Details

Assignee

Reporter

Needs QA

No

In progress time

1.5

Time tracking

No time logged1d 4h remaining

Sprint

Affects versions

Priority

Smart Checklist

Created September 16, 2024 at 1:20 PM
Updated March 13, 2025 at 9:05 AM
Resolved October 21, 2024 at 4:00 PM

Flag notifications