Second node (XXX-pxc-1) always selected as donor
Description
Environment
Attachments
- 06 Aug 2020, 08:42 AM
Smart Checklist
Activity
Tomislav Plavcic September 11, 2020 at 7:11 PM
Hi, thanks for the nice bug report and proposal for the solution!
Here's some background, node-0 is by default writer and when node-0 is being upgraded then the highest node is selected as writer in proxysql and haproxy. The intention here was to remove node-0 and highest node (node-2 in this case) as donors to prevent current writer node to become the donor.
One issue here is that the last comma in the wsrep_sst_donor option is removed with "sed 's/,$//'" because:
It first looks at the nodes specified in the donor list (irrespective of their segment). If no suitable donor is still found, the rest of the donor nodes are checked for suitability only if the donor list has a "terminating-comma".
and this comma is removed here so it just fails. This is not a problem for 5 node cluster where there are more nodes, but with 3 nodes it is because node-1 wants to do sst from itself (since node-0 and node-2 are removed from this option) and we prevented galera to search for other donors.
Serhii Prykhodko August 6, 2020 at 9:21 AM
proposed patch for pxc-configure-pxc.sh:
- DONOR_ADDRESS="$(printf '%s\n' "${PEERS[@]}" "${HOSTNAME}" | sort --version-sort | uniq | grep -v -- '-0$' | sed '$d' | tr '\n' ',' | sed 's/,$//')"
+ DONOR_ADDRESS="$(printf '%s\n' "${PEERS[@]}" | sort -r --version-sort | uniq | sed '$d' | tr '\n' ',' | sed 's/,$//')"
removed local pod hostname from donor list
allow XXX-pxc-0 to be a donor (removed grep -v)
reversed sort (so for pxc-1 list of donors will be pxc-2,pxc-0)
I can't repair the second node of the cluster because it chooses itself as a donor:
2020-08-06T08:16:31.234988Z 0 [Warning] WSREP: Member 0.0 (api-pxc-1) requested state transfer from 'api-pxc-1', but it is impossible to select State Transfer donor: Host is down 2020-08-06T08:16:31.235053Z 2 [ERROR] WSREP: Requesting state transfer failed: -112(Host is down) 2020-08-06T08:16:31.235088Z 2 [ERROR] WSREP: State transfer request failed unrecoverably: 112 (Host is down). Most likely it is due to inability to communicate with the cluster primary component. Restart required.
(full logs in attachment)
Steps to reproduce:
create 3-node cluster
delete pvc datadir-cluster-pxc-1 and pod cluster-pxc-1
pod cluster-pxc-1 now in CrashLoopBackOff