Node crash after 60s network failure
General
Escalation
General
Escalation
Description
Environment
None
Smart Checklist
Activity
Show:
Vadim Tkachenko February 10, 2021 at 1:55 PM
Additional information for this bug.
I am able to repeat this with 100% using the following steps:
Form 3 node cluster (with Operator), load sysbench-tpcc data, start sysbench workload, stop node-1 for 60s.
However in the following case am I not able to repeat this case:
Form 1 node cluster, load sysbench-tpcc data, extend cluster size to 3 (data is copied via SST), start sysbench workload, stop node-1 for 60s.
In this case node-1 is able to re-join cluster succsesfully.
my sysbench script for the reference
cat prepare.sh
act=${1:-"prepare"}
port=30406
pwds=mVEpn6gX4Wbmcm8jqF
user=root
mysql -h127.0.0.1 -P$port -uroot -p$pwds -e "CREATE DATABASE sbtest"
./tpcc.lua --mysql-host=127.0.0.1 --mysql-port=$port --mysql-user=$user --mysql-password=$pwds --mysql-db=sbtest --time=3600 --threads=10 --report-interval=1 --tables=10 --scale=100 --use_fk=0 --force_pk=1 $act
Vadim Tkachenko February 8, 2021 at 6:56 PM
seems duplicate to https://jira.percona.com/browse/PXC-3437
but hopefully contain more information for reproduction
Vadim Tkachenko February 8, 2021 at 6:54 PM
I am using PXC Operator 1.7.0 with all default deployments.
my chaos definition file:
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pxc-network-drop
namespace: chaos-testing
spec:
action: partition # the specific chaos action to inject
mode: one # the mode to run chaos action; supported modes are one/all/fixed/fixed-percent/random-max-percent
selector: # pods where to inject chaos actions
pods:
pxc: # namespace of the target pods
- cluster1-pxc-1
direction: to
target:
selector:
pods:
pxc: # namespace of the target pods
- cluster1-pxc-0
mode: one
duration: "60s"
scheduler: # scheduler rules for the running time of the chaos experiments about pods.
cron: "@every 1000s"
---
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: pxc-network-drop2
namespace: chaos-testing
spec:
action: partition # the specific chaos action to inject
mode: one # the mode to run chaos action; supported modes are one/all/fixed/fixed-percent/random-max-percent
selector: # pods where to inject chaos actions
pods:
pxc: # namespace of the target pods
- cluster1-pxc-1
direction: to
target:
selector:
pods:
pxc: # namespace of the target pods
- cluster1-pxc-2
mode: one
duration: "60s"
scheduler: # scheduler rules for the running time of the chaos experiments about pods.
cron: "@every 1000s"
Done
Details
Details
Assignee
Hrvoje Matijakovic
Hrvoje MatijakovicReporter
Vadim Tkachenko
Vadim TkachenkoLabels
Fix versions
Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist

Open Smart Checklist
Created February 8, 2021 at 6:45 PM
Updated 5 days ago
Resolved February 12, 2025 at 12:58 PM
I am not sure if this is PXC or Operator related, but I continue some work with Chaos Testing, and in this case I emulate a network failure for 60s for Node1, while continuing load on Node0 and Node2.
My expectation that after 60sec the Node1 will join the cluster back and will perform SST or IST.
Unfortunately I see a node crash while it is trying to join
[39] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.602456999, {"log"=>"2021-02-08T18:41:19.602441Z 2 [Note] [MY-000000] [Galera] Drain monitors from 487625 up to 487625"}] [40] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.602509547, {"log"=>"2021-02-08T18:41:19.602478Z 2 [Note] [MY-000000] [Galera] ####### My UUID: b0b7e122-662e-11eb-9741-b6ee524cd6a1"}] [41] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.602548415, {"log"=>"2021-02-08T18:41:19.602518Z 2 [Note] [MY-000000] [Galera] Cert index reset to 00000000-0000-0000-0000-000000000000:-1 (proto: 10), state transfer needed: yes"}] [42] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605717363, {"log"=>"2021-02-08T18:41:19.605572Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed."}] [43] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605889176, {"log"=>"2021-02-08T18:41:19.605843Z 2 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: -1"}] [44] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605908736, {"log"=>"2021-02-08T18:41:19.605888Z 2 [Note] [MY-000000] [Galera] State transfer required: "}] [45] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605909581, {"log"=>" Group state: 9067b9d6-662d-11eb-aca6-b77c1ba705f3:528957"}] [46] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605910594, {"log"=>" Local state: 9067b9d6-662d-11eb-aca6-b77c1ba705f3:487625"}] [47] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.605950472, {"log"=>"2021-02-08T18:41:19.605920Z 2 [Note] [MY-000000] [WSREP] Server status change connected -> joiner"}] [48] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606019772, {"log"=>"2021-02-08T18:41:19.605982Z 2 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification."}] [49] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606037845, {"log"=>"2021-02-08T18:41:19.606014Z 2 [Note] [MY-000000] [WSREP] WSREP: You have configured 'xtrabackup-v2' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server."}] [50] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606049301, {"log"=>"2021-02-08T18:41:19.606036Z 2 [Note] [MY-000000] [Galera] Check if state gap can be serviced using IST"}] [51] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606094865, {"log"=>"2021-02-08T18:41:19.606062Z 2 [Note] [MY-000000] [Galera] ####### IST uuid:9067b9d6-662d-11eb-aca6-b77c1ba705f3 f: 487626, l: 528957, STRv: 3"}] [52] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606259226, {"log"=>"2021-02-08T18:41:19.606220Z 2 [Note] [MY-000000] [Galera] IST receiver addr using ssl://192.168.61.152:4568"}] [53] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.606397233, {"log"=>"2021-02-08T18:41:19.606363Z 2 [Note] [MY-000000] [Galera] IST receiver using ssl"}] [54] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.607384835, {"log"=>"2021-02-08T18:41:19.607242Z 2 [Note] [MY-000000] [Galera] Prepared IST receiver for 487626-528957, listening at: ssl://192.168.61.152:4568"}] [55] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.608732564, {"log"=>"2021-02-08T18:41:19.608602Z 0 [Note] [MY-000000] [Galera] Member 2.0 (cluster1-pxc-1) requested state transfer from 'cluster1-pxc-1,'. Selected 0.0 (cluster1-pxc-2)(SYNCED) as donor."}] [56] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.608736699, {"log"=>"2021-02-08T18:41:19.608668Z 0 [Note] [MY-000000] [Galera] Shifting PRIMARY -> JOINER (TO: 528957)"}] [57] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.608870727, {"log"=>"2021-02-08T18:41:19.608805Z 2 [Note] [MY-000000] [Galera] Requesting state transfer: success, donor: 0"}] [58] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.608897138, {"log"=>"2021-02-08T18:41:19.608871Z 2 [Note] [MY-000000] [Galera] Receiving IST: 41332 writesets, seqnos 487626-528957"}] [59] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.616985619, {"log"=>"2021-02-08T18:41:19.616831Z 0 [Warning] [MY-000000] [Galera] 0.0 (cluster1-pxc-2): State transfer to 2.0 (cluster1-pxc-1) failed: -61 (No data available)"}] [60] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.616990255, {"log"=>"2021-02-08T18:41:19.616917Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1215: State transfer request failed unrecoverably because the donor seqno had gone forward during IST, but SST request was not prepared from our side due to selected state transfer method (which do not supports SST during node operation). Restart required."}] [61] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.616991959, {"log"=>"2021-02-08T18:41:19.616948Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread"}] [62] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.617027015, {"log"=>"2021-02-08T18:41:19.616978Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread"}] [63] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.617259230, {"log"=>"2021-02-08T18:41:19.617220Z 0 [Note] [MY-000000] [Galera] gcomm: closing backend"}] [64] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618285411, {"log"=>"2021-02-08T18:41:19.618158Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node"}] [65] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618289375, {"log"=>"view (view_id(NON_PRIM,7d97a7f5-85ba,47)"}] [66] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618290235, {"log"=>"memb {"}] [67] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618291148, {"log"=>" b0b7e122-9741,0"}] [68] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618291815, {"log"=>" }"}] [69] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618292404, {"log"=>"joined {"}] [70] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618293188, {"log"=>" }"}] [71] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618293895, {"log"=>"left {"}] [72] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618294716, {"log"=>" }"}] [73] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618295428, {"log"=>"partitioned {"}] [74] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618296129, {"log"=>" 7d97a7f5-85ba,0"}] [75] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618296758, {"log"=>" a5968e56-89ee,0"}] [76] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618297395, {"log"=>" }"}] [77] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618298322, {"log"=>")"}] [78] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618299215, {"log"=>"2021-02-08T18:41:19.618260Z 0 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0"}] [79] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618337791, {"log"=>"2021-02-08T18:41:19.618289Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node"}] [80] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618338804, {"log"=>"view ((empty))"}] [81] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618919697, {"log"=>"2021-02-08T18:41:19.618863Z 0 [Note] [MY-000000] [Galera] gcomm: closed"}] [82] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618964759, {"log"=>"2021-02-08T18:41:19.618927Z 0 [Note] [MY-000000] [Galera] mysqld: Terminated."}] [83] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.618983532, {"log"=>"2021-02-08T18:41:19.618959Z 0 [Note] [MY-000000] [WSREP] Initiating SST cancellation"}] [84] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619027112, {"log"=>"18:41:19 UTC - mysqld got signal 11 ;"}] [85] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619028348, {"log"=>"Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware."}] [86] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619206388, {"log"=>"Build ID: 5a2199b1784b967a713a3bde8d996dc517c41adb"}] [87] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619224432, {"log"=>"Server Version: 8.0.21-12.1 Percona XtraDB Cluster (GPL), Release rel12, Revision 4d973e2, WSREP version 26.4.3, wsrep_26.4.3"}] [88] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619225648, {"log"=>"Thread pointer: 0x0"}] [89] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619226464, {"log"=>"Attempting backtrace. You can use the following information to find out"}] [90] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619227504, {"log"=>"where mysqld died. If you see no messages after this, something went"}] [91] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619228337, {"log"=>"terribly wrong..."}] [92] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.619369012, {"log"=>"stack_bottom = 0 thread_stack 0x46000"}] [93] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624153696, {"log"=>"/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x20b4cf1]"}] [94] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624157652, {"log"=>"/usr/sbin/mysqld(handle_fatal_signal+0x3c3) [0x128c8e3]"}] [95] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624158563, {"log"=>"/lib64/libpthread.so.0(+0x12b20) [0x7f5ba752fb20]"}] [96] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624159325, {"log"=>"/lib64/libc.so.6(abort+0x203) [0x7f5ba51e7d11]"}] [97] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624160112, {"log"=>"/usr/lib64/galera4/libgalera_smm.so(+0x142f5) [0x7f5b9932a2f5]"}] [98] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624160777, {"log"=>"/usr/lib64/galera4/libgalera_smm.so(+0x179788) [0x7f5b9948f788]"}] [99] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624161390, {"log"=>"/usr/lib64/galera4/libgalera_smm.so(+0x181e4f) [0x7f5b99497e4f]"}] [100] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624162012, {"log"=>"/lib64/libpthread.so.0(+0x814a) [0x7f5ba752514a]"}] [101] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624162649, {"log"=>"/lib64/libc.so.6(clone+0x43) [0x7f5ba52c2f23]"}] [102] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624163305, {"log"=>"You may download the Percona XtraDB Cluster operations manual by visiting"}] [103] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624163996, {"log"=>"http://www.percona.com/software/percona-xtradb-cluster/. You may find information"}] [104] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624165036, {"log"=>"in the manual which will help you identify the cause of the crash."}] [105] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624165797, {"log"=>"Writing a core file using lib coredumper"}] [106] pxc.cluster1-pxc-1.mysqld-error.log: [1612809679.624166410, {"log"=>"PATH: (null)"}]