Cluster state interruption during NBO may lead to permanent cluster lock
Description
Environment
AFFECTED CS IDs
Activity

Kamil Holubicki June 30, 2023 at 2:34 PM
Looks like there are more corner cases around this area.
I just found that setting 2 nodes and skipping iptables step (so only sysbench + NBO) is enough for reproducing the problem.
Investigating....

Przemyslaw Malkowski June 21, 2023 at 8:34 AM
This limitation is not a problem. The problem is that the cluster will be completely blocked, or even the replication plugin will abort. Both behaviors lead to downtime, and the whole cluster has to be restarted to recover the service.
The operation which can't succeed due to limitations should be put into a wait state or rejected with an error, but the cluster should continue with an unrelated workload.

Kamil Holubicki June 20, 2023 at 1:19 PM
I think it works fine. There is a documented limitation:
So I think it is rather a feature request than a bug?

Kamil Holubicki June 19, 2023 at 3:32 PMEdited
This is what happens:
Replicated transaction conflicts with locally executed NBO transaction. This is because NBO transaction releases ApplyOrder monitor after nbo_phase_1. In TOI case apply order is enforced by keeping ApplyOrder monitor for the whole time of TOI, which prevents replicated transaction to be applied and cause BF-BF abort.
How to reproduce (reduced testcase):
create table t1 (a int primary key, b int);
insert into t1 values (0, 0), (1, 1);
1. node_1:
2. node_2:
We need to pause execution just before NBO applier thread before it starts applying events (and acquiring MDL locks). This is to allow other transaction (point 4) to not wait for any MDL lock and just replicate.
There is no debug sync point there. I've added this just before apply_events() call:
3. node_1:
It waits on execute_command_after_close_tables debug sync point.
4. node_2:

Przemyslaw Malkowski June 16, 2023 at 1:00 PM
I managed to get a reproducible test case for the NBO-related cluster collapse. Although the behavior differs a bit from what I reported before, I believe the root cause is the same - a BF-BF conflict between NBO and a high-prio thread.
Steps to reproduce:
Set up PXC 8.0.32 cluster using separate VMs (I used Vagrant and virsh provider, with RockyLinux 9 image)
Install sysbench on one node and prepare with:
No particular settings matter, it seems, but I disabled binary logs to avoid other issues related to NBO (unable to rotate):
Follow this sequence of actions:
Second concurrent session on node1:
Now, re-run the sysbench and/or bash loop with CREATE/DROP when they exit with errors.
Repeat that until you see the first BF-BF abort and at this point, you may stop the loops and remove iptables rules with:
At this point, the Galera plugin aborts but mysqld is up and running.
The relevant error log fragments:
node1 log:
node2 log:
node3 log:
Do note that the high-prio thread on node3 is replaying.
Some views into the situation, note that system threads are in killed state:
The only way to recover is to shutdown all nodes and bootstrap the cluster using the best candidate.
Details
Assignee
UnassignedUnassignedReporter
Przemyslaw MalkowskiPrzemyslaw MalkowskiNeeds Review
YesNeeds QA
YesFix versions
Affects versions
Priority
High
Details
Details
Assignee
Reporter

Needs Review
Needs QA
Fix versions
Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

During an ALTER running with wsrep_osu_method=NBO, occasionally, when cluster members briefly lose connectivity, the common consensus about the ALTER may be lost, and nodes stay finishing the ALTER forever.
When this was observed, two nodes were trying to end the alter, while the third did not. But still, that node, without the ALTER in progress, could not become a donor due to the NBO flag still being set.
Unfortunately, I am unable to reproduce this on demand. I reproduced it once by blocking the network intermittently on one node, but I am unable to create a 100% reproducible test case.
Here is the status of the nodes during the deadlock that has already occurred. The ALTER was issued on node1:
Nodes that replicated the event:
For some reason, the ALTER was gone on node3 but lasts forever on node1 and node2.
When I run DML trx on node3 for the same table, the two other nodes reported BF-BF conflict:
node1 log:
Later, when I tried to restart the failed nodes, node3 refused to be a donor, despite the fact it had no visible NBO ALTER running anymore, and it was the only left node:
node3 log:
Therefore, the only way to recover from the deadlock is to restart the whole cluster (bootstrap it from scratch), while allowing the joining nodes via SST only (IST not possible).
More NBO issues, possibly related, in #PXC-4228.