Unavailable cluster during slow startup of single node (possibly NUMA related)

Description

During an update yesterday we experienced a major outage of a couple of minutes of our three node PXC (5.6.39) cluster.
The outage happened because one of the nodes (db02) was going through planned maintenance (OS updates). During the maintenance the node nicely removed itself from the cluster, shut itself (MySQL) down and did the software updates (going from percona-xtradb-cluster 5.6.39 to 5.6.40).
After the software updates the node tried to rejoin the cluster and took an exceptionally long time to initialize its innodb_buffer_pool (of 80G). Instead of 1 or 2 seconds (usualy) it took more than 3 minutes!.

Since this seems to happen before the wsrep is accepting connections from the other nodes to reform the cluster, the other nodes therefor also didn't accept new queries and the cluster was effectively 'down' for the startup period of a non-primary node.

A short timeline overview:
14:09:06 MySQL is being shut down by the OS updates
14:09:18 MySQL finished shutdown
14:09:46 OS update is is finished and MySQL is being started up again
14:09:46 buffer pool is being initialized (80G)
14:11:22 buffer pool is done initializing (1m36s)
14:11:47 db02 (.152) is receiving IST from db01 (.151)
14:11:48 IST is finished.
14:11:48 buffer pool is being initialized (again)
Here the db01 (.151) and db03 (.153) stopped handling queries and downtime ensued.
14:15:14 buffer pool is initialized (3m26s)
14:15:15 WSREP is finally communicating with the other nodes again.
14:15:31 wsrep_ready is set to true and cluster is accepting queries again.

Recently we turned innodb_numa_interleave on after having an issue where the machine started swapping because half of the available memory was allocated to a single CPU. After turning on innodb_numa_interleave this problem was solved.

My guess is that possibly the numa interleave is slowing the creation of the innodb_buffer_pool down

I have attached the mysql error logs of all three nodes in the cluster. The db02 (.152) is the one that was receiving the software update.

Environment

None

Attachments

4

Smart Checklist

Activity

Show:

Krunal Bauskar July 19, 2018 at 8:43 AM

Can you create a test setup and play around with the timeout (Unfortunately, I tried it at my local end with good enough memory but didn't hit the issue). If you hit the issue in test-setup maybe you can try to increase the timeout.

BTW, I was working with default timeout and as you could see BP init took around 60 secs. Good enough to alarm timeouts.

Patrick July 19, 2018 at 8:27 AM

Hi Krunal,

This was just a view of traffic from/to node-2.
I did look into node-1 and node-3 network as well and neither of them are spiking unreasonably. node-1 was fluctuating between 60 and 70Mbps with spikes no more than 100Mbps (on a 2x1Gbps bond) and node-3 around 6Mbps peaking to no more than 10Mbps.

I did indeed also see the timeouts going on. These timeouts all coincide with the timeframe that node-2 is initializing its innodb buffer. So during this period it seems that node-1 and node-3 want to form a new cluster, but because node-2 isn't answering any WSREP commands (it seems that it's not even accepting the connections) node-1 and node-3 just 'hang' (no quorum, no cluster, no handling any queries)

You are right that right at the end node-1 also had a (single) timeout going on which lasted around 1 second. but right after that the cluster formed again. It might be related or not (I don't know), but I'm more interested in the timeouts that occur in the minutes before (14:11:54 till 14:15:11) that happen on node-2

We have not yet updated the other cluster nodes yet, but would like to. The customer is (understandably) afraid that the same
might happen again, so is looking for an explanation and hopefully some assurance that this will not happen again.
So if there's anything we can do to monitor/debug this issue better before upgrading the other nodes, that would be great.

Krunal Bauskar July 19, 2018 at 3:40 AM

Right. Graph probably is still showing a continuous flow of traffic. Is this a view from node-2 or generic n/w view because there could be still a traffic flowing on n/w wire between node-1 and node-3.

BTW following line suggested some issues connecting to node 10.x.y.151

2018-07-11 14:15:11 108063 [Note] WSREP: (95544f83, 'tcp://0.0.0.0:4567') connection to peer 1d30a492 with addr tcp://10.x.y.151:4567 timed out, no messages seen in PT3S

---------

Do you plan to upgrade other nodes or it already done.

(If not then we can track them to see if they hit the same issue).

Patrick July 18, 2018 at 2:12 PM

Patrick July 18, 2018 at 2:12 PM

Hi Krunal,

Thanks for looking into this issue. Apart from the issue of the bufferpool being created extremely slow, I didn't see any out-of-the-ordinary system spikes anywhere. Also the networking interface(s) (it's a dual-1GB bond), did not spike nor was it at any higher than usual level. During backups the traffic is at least much higher. (see attachment)
The network graph is from Tue 10 july 0:00 until Thu 12 july 23:59. So one day before and one day after the incident, to give some indication.
As previously mentioned the actual incident was at Wed 11 july, around 14:10 and 14:15. so in the middle of the graph.

Cannot Reproduce

Details

Assignee

Reporter

Time tracking

7h 30m logged

Affects versions

Priority

Smart Checklist

Created July 12, 2018 at 8:27 AM
Updated March 6, 2024 at 10:30 PM
Resolved June 26, 2019 at 9:52 AM