Aggressive network outages on one node makes the whole cluster unusable

General

Escalation

General

Escalation

Description

Now Chaos Testing with the more aggressive settings

I have three nodes 0,1,2.

The workload (sysbench) put on the node-0, while for node-1 I configure frequent network disconnects.

Every 30 seconds, Node-1 loses the network connectivity for 20 seconds.

My expectation is while node-1 will struggle, it should not affect the whole cluster.

However very soon the application gets stuck.

Here is processlist

There is no progress in running threads.

I will not provide logs, as they very extensive

Environment

None

Smart Checklist

Activity

Kamil Holubicki April 16, 2021 at 9:21 AM
Edited

I was able to reproduce this issue in having local environment (Ubuntu18.04). Setting it up may be a little hassle, but then the test itself is rather straightforward.

Host network interfaces:

127.0.0.1

192.168.254.109

node1:

uses 127.0.0.1 for cluster communication (ports 4000, 4050, 4051)

pc.weight=3 to avoid split brain.

node2:

uses 127.0.0.1 for cluster communication (ports 5000, 5050, 5051)

wsrep_slave_threads = 8 (but it happens with the default 1 as well)

sysbench:

connects to node1 using 192.168.254.109 interface

Break connectivity on 127.0.0.1

Bring it back

I have above in shape.sh script and can control it like

shape.sh start

shape.sh stop

Test steps:

Start node1, node2
Connect to node1 as root using 127.0.0.1 interface and setup sysbench stuff
start sysbench workload on node1
shape.sh start.
We will observe that sysbench (donor) has 0 TPS for 4 secs and then continues normally
shape.sh stop
IST starts on node2
while IST is in progress - shape.sh start
This causes sysbench (donor) to stuck permanently with 0 TPS
Kill node 2 (kill -9)
shape.sh start
node1 is still stuck (forever)

Kamil Holubicki April 12, 2021 at 3:18 PM

I tried to reproduce using steps from . Manual triggering of network break didn't work for me. As discussed on slack I modified sst script in the following way

And the result on node_1 is

so not reproduced so far

Marcelo Altmann February 11, 2021 at 1:46 PM

I'm able to reproduce this issue with the below steps:

Start a 3 node cluter.

Run sysbench on node1:

Remove node2 from cluster by stopping mysql

Remove node2 datadir to force SST

Start mysql on Node2.

Wait until SST complete and it starts to prepare it:

Split network on node2:

At this point, node1 will stop to process trafic:

I'm using PXC 8.0.22:

Vadim Tkachenko February 9, 2021 at 3:52 PM

my Chaos file for the reference

Done

Details
Assignee
Kamil Holubicki
Reporter
Vadim Tkachenko
Labels
strategic
Needs Review
Yes
Original estimate
7h 19m
Time tracking
1w 2d 2h 25m logged2h 27m remaining
Fix versions
8.0.23-14 (Q1 2021)
Priority
Medium
Parent
PXC-3532 PXC 8.x Stability Review and Modifications

Smart Checklist

Created February 9, 2021 at 3:45 PM

Updated March 6, 2024 at 9:15 PM

Resolved May 13, 2021 at 7:56 AM

Aggressive network outages on one node makes the whole cluster unusable

Description

Environment

Smart Checklist

Activity

Kamil Holubicki April 16, 2021 at 9:21 AMEdited

Kamil Holubicki April 12, 2021 at 3:18 PM

Marcelo Altmann February 11, 2021 at 1:46 PM

Vadim Tkachenko February 9, 2021 at 3:52 PM

DetailsAssigneeKamil HolubickiKamil HolubickiReporterVadim TkachenkoVadim TkachenkoLabelsstrategicNeeds ReviewYesOriginal estimate7h 19mTime tracking1w 2d 2h 25m logged2h 27m remainingFix versions8.0.23-14 (Q1 2021)PriorityMediumParentPXC-3532 PXC 8.x Stability Review and Modifications

Details

Assignee

Reporter

Labels

Needs Review

Original estimate

Time tracking

Fix versions

Priority

Parent

Smart ChecklistOpen Smart Checklist

Smart Checklist

Kamil Holubicki April 16, 2021 at 9:21 AM
Edited

Details
Assignee
Kamil Holubicki
Reporter
Vadim Tkachenko
Labels
strategic
Needs Review
Yes
Original estimate
7h 19m
Time tracking
1w 2d 2h 25m logged2h 27m remaining
Fix versions
8.0.23-14 (Q1 2021)
Priority
Medium
Parent
PXC-3532 PXC 8.x Stability Review and Modifications

Smart Checklist