Nodes "changing identity" can prevents primary groups
Description
Environment
AFFECTED CS IDs
Activity

Kamil Holubicki January 23, 2024 at 5:20 PM
Easier steps to reproduce:
Start 2 nodes cluster
Break network connectivity between nodes
Shutdown node 2
Bring the network up again
Start node 2
What happens?
When the network is down, both nodes go into non-primary. So node 1 sees node 2 as partitioned.
Node 2 shutdown. During the clean shutdown, gvwstate.dat file is deleted.
When Node 2 is started after the clean shutdown, its identity changes (no gvwstate.dat file
Node 1 sees Node 2 joining the cluster:
gmcast layer detects that the identity changed
evs layer does not keep this information
pc layer is notified about node 2 joining with new identity, but it still keeps the old identity and things this node is partitioned. On this layer there is no link between old and new identity, so it does not know it is the same node
Solution: pc layer has to be notified about node identity change.

yoann.lacancellera November 17, 2023 at 10:08 AM
About "stop pxc-node3":
it used to reproduce fairly easily on 5.7 because nodes were aborting and shutting down when non-primary, which is apparently not that frequent in pxc 8.0 anymore.
Though, it still happens to get stopped in some productions due to automation (e.g puppet), manual restarts, if a SST got cancelled, and probably for other reasons I am still searching.
So, this does not seem like a stretch to me to include this "stop"

yoann.lacancellera November 17, 2023 at 9:46 AM
Reproduction:
1. install regular pxc on docker
https://docs.percona.com/percona-xtradb-cluster/8.0/docker.html
With all the steps, no need any modification
2. break network
=> disconnect network on pxc node2 + node3
Then, after some time to let node1 go non-primary, reconnect node2. It will keep non primary
Alternative: reconnecting network at the same time on node2 and 3 would enable "merge quorum"
Stop node3, reconnect network, restart it
Alternative: just reconnecting network can enable merge quorum again in this case
All nodes will stay non primary
Now, restart node3 in loop will make it duplicated
nodes then won't forget about the last one, for some reason it's not cleaned:

yoann.lacancellera October 12, 2023 at 1:46 PM
Will do, working on it

Kamil Holubicki October 12, 2023 at 8:46 AM
Please provide steps to reproduce.
Details
Details
Assignee

Reporter

After certain issues, nodes logs can be flooded with "changed identity" events
Ultimately giving views like (it's supposed to be a 3 node cluster)
Which only provokes non-primary, even when a majority of nodes should have been able to merge quorum
Translating the above shows how the huge list of "partitioned" are the same node over and over again: