Node crash after 60s network failure

Description

I am not sure if this is PXC or Operator related, but I continue some work with Chaos Testing, and in this case I emulate a network failure for 60s for Node1, while continuing load on Node0 and Node2.

My expectation that after 60sec the Node1 will join the cluster back and will perform SST or IST.

 

Unfortunately I see a node crash while it is trying to join

 

Environment

None

Smart Checklist

Activity

Show:

Vadim Tkachenko February 10, 2021 at 1:55 PM

Additional information for this bug.

I am able to repeat this with 100% using the following steps:

 

Form 3 node cluster (with Operator), load sysbench-tpcc data, start sysbench workload, stop node-1 for 60s.

 

However in the following case am I not able to repeat this case:

Form 1 node cluster, load sysbench-tpcc data, extend cluster size to 3 (data is copied via SST), start sysbench workload, stop node-1 for 60s.

In this case node-1 is able to re-join cluster succsesfully.

 

my sysbench script for the reference

 

 

 

Vadim Tkachenko February 8, 2021 at 6:56 PM

seems duplicate to https://jira.percona.com/browse/PXC-3437

but hopefully contain more information for reproduction

Vadim Tkachenko February 8, 2021 at 6:54 PM

I am using PXC Operator 1.7.0 with all default deployments.

my chaos definition file:

 

Done

Details

Assignee

Reporter

Fix versions

Affects versions

Priority

Smart Checklist

Created February 8, 2021 at 6:45 PM
Updated 4 days ago
Resolved February 12, 2025 at 12:58 PM