Server hangs up if cannot connect to members PXC 8.0
Description
Environment
Attachments
relates to
Smart Checklist
Activity

Kamil Holubicki April 3, 2023 at 7:03 AM
Hi , We need to document the new variable:
This is the new wsrep_provider_option
https://docs.percona.com/percona-xtradb-cluster/8.0/wsrep-provider-index.html
Name: pc.wait_restored_prim_timeout
Option | Description |
---|---|
Command Line: | Yes |
Config File: | Yes |
Scope: | Global |
Dynamic: | No |
Default Value: | PT0S |
This variable is used to specify the period of time to wait for a primary component when the cluster restores the primary component from gvwstate.dat file after the outage. Default value PT0S means that the node waits infinitely (old/original behavior)
More info HERE

Kamil Holubicki March 28, 2023 at 11:29 AMEdited
As discussed on slack:
kill -9 works fine
I introduced a new variable pc.wait_restored_prim_timeout to specify the timeout the node is allowed to wait in case of the view is restored from gvwstate.data file. (defaults to 0, which means infinity - current behavior, otherwise wait specified seconds)
The fix is only for 8.0. I don't think we need it for 5.7 as well.

Mykola Marzhan May 6, 2020 at 1:08 PM
thank you a lot for the explanation. I have updated "Expected result" in the description.

Slava Sarzhan May 6, 2020 at 12:54 PM
Thank you a lot for analysing and detailed explanation of the logs. And everything is ok from PXC side in this case but I saw one interesting thing during my testing. When the node tries to connect to old members using old ips it does not respond to SIGTERM signal. I have tried to send 'kill -15 process's pid' even 'kill -9' does not help. I can not find any useful information in the logs connected with it just connect messages.
It is not good because when we face such situation we even can't terminate the PXC (or pod with pxc). Please share your thoughts on it.
Thank you.

Marcelo Altmann May 5, 2020 at 6:48 PMEdited
Hi .
Here are my analyzes of the logs:
Node0
Node0 Started, it was able to recover the view from gvwstate.dat file and tried to connect to old members using the ips specified over wsrep_cluster_address IPs .87 and .75 . Nodes .87 and .75 had changed their IPs and node0 had no way to know it, so it waits until both join the cluster and connect to itself (node0).
Node1
Same as Node0, it was waiting until the other nodes from IP .77 and .75 becomes available OR other node connects to it.
Node2
The same story with Node2, it still has the old ips which are not valid anymore and its unable to reconnect with other two nodes. However, after some time node2 restarts:
At this time, node2 starts passing the correct IP of the other two nodes and connects to the other nodes that are waiting to form a cluster. By having node2 connected to the cluster, the view gets updated so node1 can see node0 and vice versa (they forget about the old IP and use the updated one passed by node2).
I don't see anything wrong on the cluster behavior, rather this is an issue on operator side. Without having at least one node using the correct IP of other nodes it will be impossible for the cluster to restart.
What happened on Node2 that it changed the wsrep_cluster_address on the second restart ?
Details
Details
Assignee
Reporter

Needs Review
Needs QA
Time tracking
Components
Fix versions
Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist

Hi Team,
STR:
1. start 3 node PXC cluster
2. kill all members
3. put not reachable IP addresses into wsrep_cluster_address on one instance
4. start an instance with unreachable peers
Expected result:
mysqld should exit on SIGTERM signal
Current result:
mysql become hung, don't react on sigterm
many messages in the logs