Server hangs up if cannot connect to members PXC 8.0

Description

Hi Team,

STR:
1. start 3 node PXC cluster
2. kill all members
3. put not reachable IP addresses into wsrep_cluster_address on one instance
4. start an instance with unreachable peers

Expected result:
mysqld should exit on SIGTERM signal

Current result:
mysql become hung, don't react on sigterm
many messages in the logs

Environment

None

Attachments

4

Smart Checklist

Activity

Show:

Kamil Holubicki April 3, 2023 at 7:03 AM

Hi , We need to document the new variable:

This is the new wsrep_provider_option

https://docs.percona.com/percona-xtradb-cluster/8.0/wsrep-provider-index.html

Name: pc.wait_restored_prim_timeout

Option

Description

Command Line:

Yes

Config File:

Yes

Scope:

Global

Dynamic:

No

Default Value:

PT0S

This variable is used to specify the period of time to wait for a primary component when the cluster restores the primary component from gvwstate.dat file after the outage. Default value PT0S means that the node waits infinitely (old/original behavior)

More info HERE

Kamil Holubicki March 28, 2023 at 11:29 AM
Edited

As discussed on slack:

  1. kill -9 works fine

  2. I introduced a new variable pc.wait_restored_prim_timeout to specify the timeout the node is allowed to wait in case of the view is restored from gvwstate.data file. (defaults to 0, which means infinity - current behavior, otherwise wait specified seconds)

The fix is only for 8.0. I don't think we need it for 5.7 as well.

Mykola Marzhan May 6, 2020 at 1:08 PM

thank you a lot for the explanation. I have updated "Expected result" in the description.

Slava Sarzhan May 6, 2020 at 12:54 PM

Thank you a lot for analysing and detailed explanation of the logs. And everything is ok from PXC side in this case but I saw one interesting thing during my testing. When the node tries to connect to old members using old ips it does not respond to SIGTERM signal. I have tried to send 'kill -15 process's pid' even 'kill -9' does not help. I can not find any useful information in the logs connected with it just connect messages.

It is not good because when we face such situation we even can't terminate the PXC (or pod with pxc). Please share your thoughts on it.

Thank you.

Marcelo Altmann May 5, 2020 at 6:48 PM
Edited

Hi .

Here are my analyzes of the logs:

Node0

 

Node0 Started, it was able to recover the view from gvwstate.dat file and tried to connect to old members using the ips specified over wsrep_cluster_address IPs .87 and .75 . Nodes .87 and .75 had changed their IPs and node0 had no way to know it, so it waits until both join the cluster and connect to itself (node0). 

Node1

 

Same as Node0, it was waiting until the other nodes from IP .77 and .75 becomes available OR other node connects to it.

Node2

 

The same story with Node2, it still has the old ips which are not valid anymore and its unable to reconnect with other two nodes. However, after some time node2 restarts:

At this time, node2 starts passing the correct IP of the other two nodes and connects to the other nodes that are waiting to form a cluster. By having node2 connected to the cluster, the view gets updated so node1 can see node0 and vice versa (they forget about the old IP and use the updated one passed by node2).

 

I don't see anything wrong on the cluster behavior, rather this is an issue on operator side. Without having at least one node using the correct IP of other nodes it will be impossible for the cluster to restart. 

What happened on Node2 that it changed the wsrep_cluster_address on the second restart ?

 

 

 

Done

Details

Assignee

Reporter

Needs Review

Yes

Needs QA

Yes

Time tracking

3d 2h logged

Components

Priority

Smart Checklist

Created March 9, 2020 at 5:54 PM
Updated March 6, 2024 at 9:44 PM
Resolved August 1, 2023 at 8:07 AM