Node crash after 60s network failure

General

Escalation

General

Escalation

Description

I am not sure if this is PXC or Operator related, but I continue some work with Chaos Testing, and in this case I emulate a network failure for 60s for Node1, while continuing load on Node0 and Node2.

My expectation that after 60sec the Node1 will join the cluster back and will perform SST or IST.

Unfortunately I see a node crash while it is trying to join

[39] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.602456999, {"log"=>"2021-02-08T18:41:19.602441Z 2 [Note] [MY-000000] [Galera] Drain monitors from 487625 up to 487625"}]
[40] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.602509547, {"log"=>"2021-02-08T18:41:19.602478Z 2 [Note] [MY-000000] [Galera] ####### My UUID: b0b7e122-662e-11eb-9741-b6ee524cd6a1"}]
[41] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.602548415, {"log"=>"2021-02-08T18:41:19.602518Z 2 [Note] [MY-000000] [Galera] Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes"}]
[42] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605717363, {"log"=>"2021-02-08T18:41:19.605572Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed."}]
[43] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605889176, {"log"=>"2021-02-08T18:41:19.605843Z 2 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1"}]
[44] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605908736, {"log"=>"2021-02-08T18:41:19.605888Z 2 [Note] [MY-000000] [Galera] State transfer required: "}]
[45] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605909581, {"log"=>" Group state: 9067b9d6-662d-11eb-aca6-b77c1ba705f3:528957"}]
[46] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605910594, {"log"=>" Local state: 9067b9d6-662d-11eb-aca6-b77c1ba705f3:487625"}]
[47] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605950472, {"log"=>"2021-02-08T18:41:19.605920Z 2 [Note] [MY-000000] [WSREP] Server status change connected -> joiner"}]
[48] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606019772, {"log"=>"2021-02-08T18:41:19.605982Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification."}]
[49] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606037845, {"log"=>"2021-02-08T18:41:19.606014Z 2 [Note] [MY-000000] [WSREP] WSREP: You have configured 'xtrabackup-v2' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server."}]
[50] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606049301, {"log"=>"2021-02-08T18:41:19.606036Z 2 [Note] [MY-000000] [Galera] Check if state gap can be serviced using IST"}]
[51] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606094865, {"log"=>"2021-02-08T18:41:19.606062Z 2 [Note] [MY-000000] [Galera] ####### IST uuid:9067b9d6-662d-11eb-aca6-b77c1ba705f3 f: 487626, l: 528957, STRv: 3"}]
[52] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606259226, {"log"=>"2021-02-08T18:41:19.606220Z 2 [Note] [MY-000000] [Galera] IST receiver addr using ssl://192.168.61.152:4568"}]
[53] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606397233, {"log"=>"2021-02-08T18:41:19.606363Z 2 [Note] [MY-000000] [Galera] IST receiver using ssl"}]
[54] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.607384835, {"log"=>"2021-02-08T18:41:19.607242Z 2 [Note] [MY-000000] [Galera] Prepared IST receiver for 487626-528957, listening at: ssl://192.168.61.152:4568"}]
[55] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.608732564, {"log"=>"2021-02-08T18:41:19.608602Z 0 [Note] [MY-000000] [Galera] Member 2.0 (cluster1-pxc-1) requested state transfer from 'cluster1-pxc-1,'. Selected 0.0 (cluster1-pxc-2)(SYNCED) as donor."}]
[56] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.608736699, {"log"=>"2021-02-08T18:41:19.608668Z 0 [Note] [MY-000000] [Galera] Shifting PRIMARY -> JOINER (TO: 528957)"}]
[57] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.608870727, {"log"=>"2021-02-08T18:41:19.608805Z 2 [Note] [MY-000000] [Galera] Requesting state transfer: success, donor: 0"}]
[58] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.608897138, {"log"=>"2021-02-08T18:41:19.608871Z 2 [Note] [MY-000000] [Galera] Receiving IST: 41332 writesets, seqnos 487626-528957"}]
[59] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.616985619, {"log"=>"2021-02-08T18:41:19.616831Z 0 [Warning] [MY-000000] [Galera] 0.0 (cluster1-pxc-2): State transfer to 2.0 (cluster1-pxc-1) failed: -61 (No data available)"}]
[60] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.616990255, {"log"=>"2021-02-08T18:41:19.616917Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1215: State transfer request failed unrecoverably because the donor seqno had gone forward during IST, but SST request was not prepared from our side due to selected state transfer method (which do not supports SST during node operation). Restart required."}]
[61] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.616991959, {"log"=>"2021-02-08T18:41:19.616948Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread"}]
[62] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.617027015, {"log"=>"2021-02-08T18:41:19.616978Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread"}]
[63] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.617259230, {"log"=>"2021-02-08T18:41:19.617220Z 0 [Note] [MY-000000] [Galera] gcomm: closing backend"}]
[64] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618285411, {"log"=>"2021-02-08T18:41:19.618158Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node"}]
[65] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618289375, {"log"=>"view (view_id(NON_PRIM,7d97a7f5-85ba,47)"}]
[66] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618290235, {"log"=>"memb {"}]
[67] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618291148, {"log"=>" b0b7e122-9741,0"}]
[68] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618291815, {"log"=>" }"}]
[69] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618292404, {"log"=>"joined {"}]
[70] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618293188, {"log"=>" }"}]
[71] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618293895, {"log"=>"left {"}]
[72] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618294716, {"log"=>" }"}]
[73] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618295428, {"log"=>"partitioned {"}]
[74] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618296129, {"log"=>" 7d97a7f5-85ba,0"}]
[75] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618296758, {"log"=>" a5968e56-89ee,0"}]
[76] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618297395, {"log"=>" }"}]
[77] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618298322, {"log"=>")"}]
[78] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618299215, {"log"=>"2021-02-08T18:41:19.618260Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0"}]
[79] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618337791, {"log"=>"2021-02-08T18:41:19.618289Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node"}]
[80] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618338804, {"log"=>"view ((empty))"}]
[81] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618919697, {"log"=>"2021-02-08T18:41:19.618863Z 0 [Note] [MY-000000] [Galera] gcomm: closed"}]
[82] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618964759, {"log"=>"2021-02-08T18:41:19.618927Z 0 [Note] [MY-000000] [Galera] mysqld: Terminated."}]
[83] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618983532, {"log"=>"2021-02-08T18:41:19.618959Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation"}]
[84] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619027112, {"log"=>"18:41:19 UTC - mysqld got signal 11 ;"}]
[85] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619028348, {"log"=>"Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware."}]
[86] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619206388, {"log"=>"Build ID: 5a2199b1784b967a713a3bde8d996dc517c41adb"}]
[87] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619224432, {"log"=>"Server Version: 8.0.21-12.1 Percona XtraDB Cluster (GPL), Release rel12, Revision 4d973e2, WSREP version 26.4.3, wsrep_26.4.3"}]
[88] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619225648, {"log"=>"Thread pointer: 0x0"}]
[89] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619226464, {"log"=>"Attempting backtrace. You can use the following information to find out"}]
[90] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619227504, {"log"=>"where mysqld died. If you see no messages after this, something went"}]
[91] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619228337, {"log"=>"terribly wrong..."}]
[92] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619369012, {"log"=>"stack_bottom = 0 thread_stack 0x46000"}]
[93] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624153696, {"log"=>"/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x20b4cf1]"}]
[94] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624157652, {"log"=>"/usr/sbin/mysqld(handle_fatal_signal+0x3c3) [0x128c8e3]"}]
[95] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624158563, {"log"=>"/lib64/libpthread.so.0(+0x12b20) [0x7f5ba752fb20]"}]
[96] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624159325, {"log"=>"/lib64/libc.so.6(abort+0x203) [0x7f5ba51e7d11]"}]
[97] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624160112, {"log"=>"/usr/lib64/galera4/libgalera_smm.so(+0x142f5) [0x7f5b9932a2f5]"}]
[98] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624160777, {"log"=>"/usr/lib64/galera4/libgalera_smm.so(+0x179788) [0x7f5b9948f788]"}]
[99] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624161390, {"log"=>"/usr/lib64/galera4/libgalera_smm.so(+0x181e4f) [0x7f5b99497e4f]"}]
[100] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624162012, {"log"=>"/lib64/libpthread.so.0(+0x814a) [0x7f5ba752514a]"}]
[101] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624162649, {"log"=>"/lib64/libc.so.6(clone+0x43) [0x7f5ba52c2f23]"}]
[102] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624163305, {"log"=>"You may download the Percona XtraDB Cluster operations manual by visiting"}]
[103] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624163996, {"log"=>"http://www.percona.com/software/percona-xtradb-cluster/. You may find information"}]
[104] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624165036, {"log"=>"in the manual which will help you identify the cause of the crash."}]
[105] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624165797, {"log"=>"Writing a core file using lib coredumper"}]
[106] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624166410, {"log"=>"PATH: (null)"}]

Environment

None

Smart Checklist

Activity

Show:

Vadim Tkachenko February 10, 2021 at 1:55 PM

Additional information for this bug.

I am able to repeat this with 100% using the following steps:

Form 3 node cluster (with Operator), load sysbench-tpcc data, start sysbench workload, stop node-1 for 60s.

However in the following case am I not able to repeat this case:

Form 1 node cluster, load sysbench-tpcc data, extend cluster size to 3 (data is copied via SST), start sysbench workload, stop node-1 for 60s.

In this case node-1 is able to re-join cluster succsesfully.

my sysbench script for the reference

cat prepare.sh 
act=${1:-"prepare"}
port=30406
pwds=mVEpn6gX4Wbmcm8jqF
user=root
mysql -h127.0.0.1 -P$port -uroot -p$pwds -e "CREATE DATABASE sbtest"
./tpcc.lua --mysql-host=127.0.0.1 --mysql-port=$port --mysql-user=$user --mysql-password=$pwds --mysql-db=sbtest --time=3600 --threads=10 --report-interval=1 --tables=10 --scale=100 --use_fk=0 --force_pk=1 $act

Vadim Tkachenko February 8, 2021 at 6:56 PM

seems duplicate to https://jira.percona.com/browse/PXC-3437

but hopefully contain more information for reproduction

Vadim Tkachenko February 8, 2021 at 6:54 PM

I am using PXC Operator 1.7.0 with all default deployments.

my chaos definition file:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
 name: pxc-network-drop
 namespace: chaos-testing
spec:
 action: partition # the specific chaos action to inject
 mode: one # the mode to run chaos action; supported modes are one/all/fixed/fixed-percent/random-max-percent
 selector: # pods where to inject chaos actions
 pods:
 pxc: # namespace of the target pods
 - cluster1-pxc-1
 direction: to
 target:
 selector:
 pods:
 pxc: # namespace of the target pods
 - cluster1-pxc-0
 mode: one 
 duration: "60s"
 scheduler: # scheduler rules for the running time of the chaos experiments about pods.
 cron: "@every 1000s"
---
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
 name: pxc-network-drop2
 namespace: chaos-testing
spec:
 action: partition # the specific chaos action to inject
 mode: one # the mode to run chaos action; supported modes are one/all/fixed/fixed-percent/random-max-percent
 selector: # pods where to inject chaos actions
 pods:
 pxc: # namespace of the target pods
 - cluster1-pxc-1
 direction: to
 target:
 selector:
 pods:
 pxc: # namespace of the target pods
 - cluster1-pxc-2
 mode: one 
 duration: "60s"
 scheduler: # scheduler rules for the running time of the chaos experiments about pods.
 cron: "@every 1000s"

Done

Details

Assignee

Hrvoje Matijakovic

Reporter

Vadim Tkachenko

Priority

Medium

Smart Checklist

Created February 8, 2021 at 6:45 PM

Updated 5 days ago

Resolved February 12, 2025 at 12:58 PM

Configure

Percona XtraDB Cluster

Node crash after 60s network failure

Description

Environment

Smart Checklist

Activity

Vadim Tkachenko February 10, 2021 at 1:55 PM

Vadim Tkachenko February 8, 2021 at 6:56 PM

Vadim Tkachenko February 8, 2021 at 6:54 PM

Details

Assignee

Reporter

Labels

Fix versions

Affects versions

Priority

Smart Checklist

Flag notifications

Something's gone wrong