Aggressive network outages on one node makes the whole cluster unusable

Description

Now Chaos Testing with the more aggressive settings

I have three nodes 0,1,2.

The workload (sysbench) put on the node-0, while for node-1 I configure frequent network disconnects.

Every 30 seconds, Node-1 loses the network connectivity for 20 seconds.

My expectation is while node-1 will struggle, it should not affect the whole cluster.

However very soon the application gets stuck.

Here is processlist

 

There is no progress in running threads.

 

I will not provide logs, as they very extensive

Environment

None

Smart Checklist

Activity

Kamil Holubicki April 16, 2021 at 9:21 AM
Edited

I was able to reproduce this issue in having local environment (Ubuntu18.04). Setting it up may be a little hassle, but then the test itself is rather straightforward. 

Host network interfaces:

127.0.0.1

192.168.254.109

node1:

uses 127.0.0.1 for cluster communication (ports 4000, 4050, 4051)

pc.weight=3 to avoid split brain.

node2:

uses 127.0.0.1 for cluster communication (ports 5000, 5050, 5051)

wsrep_slave_threads = 8 (but it happens with the default 1 as well)

sysbench:

connects to node1 using 192.168.254.109 interface

 

Break connectivity on 127.0.0.1

Bring it back 

I have above in shape.sh script and can control it like

shape.sh start

shape.sh stop

 

Test steps:

  1. Start node1, node2

  2. Connect to node1 as root using 127.0.0.1 interface and setup sysbench stuff

  3. start sysbench workload on node1

  4. shape.sh start.
    We will observe that sysbench (donor) has 0 TPS for 4 secs and then continues normally

  5. shape.sh stop
    IST starts on node2

  6. while IST is in progress - shape.sh start
    This causes sysbench (donor) to stuck permanently with 0 TPS

  7. Kill node 2 (kill -9)

  8. shape.sh start
    node1 is still stuck (forever)

 

 

Kamil Holubicki April 12, 2021 at 3:18 PM

I tried to reproduce using steps from . Manual triggering of network break didn't work for me. As discussed on slack I modified sst script in the following way

 

And the result on node_1 is

so not reproduced so far

Marcelo Altmann February 11, 2021 at 1:46 PM

I'm able to reproduce this issue with the below steps:

Start a 3 node cluter.

Run sysbench on node1:

Remove node2 from cluster by stopping mysql

Remove node2 datadir to force SST

Start mysql on Node2.

Wait until SST complete and it starts to prepare it:

Split network on node2:

At this point, node1 will stop to process trafic:

 

I'm using PXC 8.0.22:

Vadim Tkachenko February 9, 2021 at 3:52 PM

my Chaos file for the reference

 

Done

Details

Assignee

Reporter

Labels

Needs Review

Yes

Original estimate

Time tracking

1w 2d 2h 25m logged2h 27m remaining

Priority

Smart Checklist

Created February 9, 2021 at 3:45 PM
Updated March 6, 2024 at 9:15 PM
Resolved May 13, 2021 at 7:56 AM