Nodes "changing identity" can prevents primary groups

Description

After certain issues, nodes logs can be flooded with "changed identity" events

Ultimately giving views like (it's supposed to be a 3 node cluster)

Which only provokes non-primary, even when a majority of nodes should have been able to merge quorum

Translating the above shows how the huge list of "partitioned" are the same node over and over again:

Environment

None

AFFECTED CS IDs

CS0040422

Activity

Kamil Holubicki 
January 23, 2024 at 5:20 PM

Easier steps to reproduce:

  1. Start 2 nodes cluster

  2. Break network connectivity between nodes

  3. Shutdown node 2

  4. Bring the network up again

  5. Start node 2

What happens?

  1. When the network is down, both nodes go into non-primary. So node 1 sees node 2 as partitioned.

  2. Node 2 shutdown. During the clean shutdown, gvwstate.dat file is deleted.

  3. When Node 2 is started after the clean shutdown, its identity changes (no gvwstate.dat file

  4. Node 1 sees Node 2 joining the cluster:

    1. gmcast layer detects that the identity changed

    2. evs layer does not keep this information

    3. pc layer is notified about node 2 joining with new identity, but it still keeps the old identity and things this node is partitioned. On this layer there is no link between old and new identity, so it does not know it is the same node

Solution: pc layer has to be notified about node identity change.

yoann.lacancellera 
November 17, 2023 at 10:08 AM

About "stop pxc-node3":

it used to reproduce fairly easily on 5.7 because nodes were aborting and shutting down when non-primary, which is apparently not that frequent in pxc 8.0 anymore.

Though, it still happens to get stopped in some productions due to automation (e.g puppet), manual restarts, if a SST got cancelled,  and probably for other reasons I am still searching.

 

So, this does not seem like a stretch to me to include this "stop"

yoann.lacancellera 
November 17, 2023 at 9:46 AM

Reproduction:

1. install regular pxc on docker

https://docs.percona.com/percona-xtradb-cluster/8.0/docker.html

With all the steps, no need any modification

 

2. break network

 

=> disconnect network on pxc node2 + node3

 

Then, after some time to let node1 go non-primary, reconnect node2. It will keep non primary

 

Alternative: reconnecting network at the same time on node2 and 3 would enable "merge quorum"

 

 

Stop node3, reconnect network, restart it

Alternative: just reconnecting network can enable merge quorum again in this case

 

All nodes will stay non primary

 

 

 

Now, restart node3 in loop will make it duplicated

 

 

 

nodes then won't forget about the last one, for some reason it's not cleaned:

yoann.lacancellera 
October 12, 2023 at 1:46 PM

Will do, working on it

Kamil Holubicki 
October 12, 2023 at 8:46 AM

Please provide steps to reproduce.

Done

Details

Assignee

Reporter

Needs Review

Needs QA

Time tracking

5h logged

Sprint

Affects versions

Priority

Created October 11, 2023 at 2:45 PM
Updated May 2, 2024 at 12:54 PM
Resolved April 8, 2024 at 12:13 PM