Unstable network connection hangs the node being partitioned (occasionally)

Description

Repro:
(1) create nodes 1 and 2 on VM1
(2) create node 3 on VM2
(3) start sysbench on node1
(4) Connect/disconnect the network between VM2 and VM1 (I use iptables on VM2 to drop packets)

(i'm running with wsrep-debug=ON)

The node never ends up joining the cluster.

2017-01-08T11:04:19.033261Z 0 [Note] WSREP: gcs/src/gcs_group.cpp:group_print_state_debug():574:
Version : 4
Flags : 0x2
Protocols : 0 / 7 / 3
State : SYNCED
Desync count : 0
Prim state : SYNCED
Prim UUID : 31da95d9-d592-11e6-85da-0e294679a07b
Prim seqno : 58
First seqno : 961501
Last seqno : 1007941
Prim JOINED : 2
State UUID : 33ad4072-d592-11e6-aba2-5f5792f79a27
Group UUID : c36d6826-c2de-11e6-9783-5af6c3ff0c34
Name : 'node2'
Incoming addr: '192.168.86.102:4200'

2017-01-08T11:04:19.033295Z 0 [Note] [Debug] WSREP: gcs/src/gcs_group.cpp:group_post_state_exchange():288: STATE EXCHANGE: 33ad4072-d592-11e6-aba2-5f5792f79a27 complete.
2017-01-08T11:04:19.033326Z 0 [Note] [Debug] WSREP: gcs/src/gcs_node.cpp:gcs_node_update_status():150: #281,#298 Carry over last prim state for node1: SYNCED
2017-01-08T11:04:19.033355Z 0 [Note] [Debug] WSREP: gcs/src/gcs_node.cpp:gcs_node_update_status():150: #281,#298 Carry over last prim state for node2: SYNCED
2017-01-08T11:04:19.033384Z 0 [Note] WSREP: gcs/src/gcs_group.cpp:group_post_state_exchange():369: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 58,
members = 2/3 (joined/total),
act_id = 1007941,
last_appl. = 1007512,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = c36d6826-c2de-11e6-9783-5af6c3ff0c34
2017-01-08T11:04:19.033418Z 0 [Note] WSREP: gcs/src/gcs.cpp:_set_fc_limits():761: Flow-control interval: [28, 28]
2017-01-08T11:04:19.033446Z 0 [Note] [Debug] WSREP: gcs/src/gcs_sm.hpp:gcs_sm_continue():378: Trying to continue unpaused monitor
2017-01-08T11:04:19.033473Z 0 [Note] WSREP: gcs/src/gcs.cpp:gcs_shift_state():561: Shifting OPEN -> PRIMARY (TO: 1007941)
2017-01-08T11:04:19.523065Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT2.49997S
2017-01-08T11:04:19.523591Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT2.50032S
2017-01-08T11:04:19.820527Z 0 [Note] [Debug] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_last_msg():649: New COMMIT CUT 1008024 after 1008024 from 1
2017-01-08T11:04:20.023793Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT1.99924S
2017-01-08T11:04:20.023915Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT1.99959S
2017-01-08T11:04:20.524273Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT1.49876S
2017-01-08T11:04:20.524394Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT1.49911S
2017-01-08T11:04:21.024414Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT0.998608S
2017-01-08T11:04:21.024533Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT0.99896S
2017-01-08T11:04:21.079607Z 0 [Note] [Debug] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_last_msg():649: New COMMIT CUT 1008154 after 1008154 from 2
2017-01-08T11:04:21.524574Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT0.498458S
2017-01-08T11:04:21.524789Z 0 [Note] [Debug] WSREP: /gmcast.cpp:check_liveness():1137: continuing relaying for PT0.498809S
2017-01-08T11:04:22.024721Z 0 [Note] WSREP: /gmcast.cpp:check_liveness():1188: (68f6b760, 'tcp://192.168.86.105:5130') turning message relay requesting off
2017-01-08T11:04:22.208464Z 0 [Note] [Debug] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_last_msg():649: New COMMIT CUT 1008280 after 1008280 from 1
2017-01-08T11:04:22.794606Z 0 [Note] [Debug] WSREP: /asio_tcp.cpp:failed_handler():44: failed handler from read_completion_condition:486 socket 0x7fffe00608a8 67 error asio.system:104 1 state 4
2017-01-08T11:04:22.794707Z 0 [Note] [Debug] WSREP: /asio_tcp.cpp:failed_handler():51: local endpoint tcp://192.168.86.105:46805 remote endpoint tcp://192.168.86.102:4130
2017-01-08T11:04:22.794760Z 0 [Note] [Debug] WSREP: /asio_tcp.cpp:failed_handler():44: failed handler from read_handler:393 socket 0x7fffe00608a8 67 error asio.system:104 1 state 3
2017-01-08T11:04:22.794782Z 0 [Note] [Debug] WSREP: /asio_tcp.cpp:failed_handler():51: local endpoint tcp://192.168.86.105:46805 remote endpoint tcp://192.168.86.102:4130
2017-01-08T11:04:22.794798Z 0 [Note] [Debug] WSREP: /asio_tcp.cpp:~AsioTcpSocket():32: dtor for 0x7fffe00608a8
2017-01-08T11:04:23.404198Z 0 [Note] [Debug] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_last_msg():649: New COMMIT CUT 1008408 after 1008408 from 1
2017-01-08T11:04:24.600258Z 0 [Note] [Debug] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_last_msg():649: New COMMIT CUT 1008537 after 1008537 from 1

(And then see a series of the "New COMMIT CUT" messages)

Environment

None

Smart Checklist

Activity

Show:

Julia Vural March 4, 2025 at 9:28 PM

It appears that this issue is no longer being worked on, so we are closing it for housekeeping purposes. If you believe the issue still exists, please open a new ticket after confirming it's present in the latest release.

Krunal Bauskar March 13, 2017 at 6:36 AM

Can you reproduce this with upstream. If yes, please file a bug with said test-case.

Won't Do

Details

Assignee

Reporter

Labels

Priority

Smart Checklist

Created January 8, 2017 at 11:16 AM
Updated March 4, 2025 at 9:28 PM
Resolved March 4, 2025 at 9:28 PM