Entire cluster hangs and application reports deadlocks.
Description
Environment
Attachments
Activity

Aaditya Dubey March 22, 2024 at 12:00 PM
Hi
We still haven't heard any news from you. So, I assume the issue no longer persists and will close the ticket. If you disagree, reply and create a follow-up new Jira Ticket.

Sveta Smirnova December 13, 2023 at 11:01 PM
Thank you for the new information.
Around timestamp 1687352827.917616943 I see:
Hanging InnoDB status output in 34290_bridgeclubs-kennemerland02-db-wordpress-db-pxc-2.log showing that this node, likely, the writer
Messages about node dc933056-b77a left the cluster in 34288_bridgeclubs-kennemerland02-db-wordpress-db-pxc-0.log and 34289_bridgeclubs-kennemerland02-db-wordpress-db-pxc-1.log logs
After these messages I see InnoDB status output in 34288_bridgeclubs-kennemerland02-db-wordpress-db-pxc-0.log
It could be load balancer issue. Which load balancer do you use? Could you please provide its logs from the same time? Also send us configuration (cr.yaml) for your cluster: I want to check if this could be load balancer issue.
You can use pt-k8s-debug-collector (https://docs.percona.com/percona-toolkit/pt-k8s-debug-collector.html) to collect this information.

Mihail Vukadinoff August 10, 2023 at 12:51 PM
Hi, Just wanted to add the full set of configuration files that the Percona Kubernetes operator is generating , our settings go just in the "init.cfg" file

Mihail Vukadinoff August 8, 2023 at 7:36 AM
Hello, we had another occurrence from yesterday and we collected more info before restarting the cluster.
We have the logs from the 3 pods, we have inndb status and the processlist.
It's worth metnioning another change we did recently, we found out that many of the apps were pointing to the "pxc" kubernetes service which just load balances between the pods, while the haproxy has a logic to pick one of the db memebers as primary and the others are backup - so requests will be sent only to the primary and the backups will be used only when the primary is not available. So we ensured all the apps are using the haproxy service and not connecting directly. In theory this should reduce the chance of deadlocks as no simultaneous transactions will run in parallel on different cluster members so the certification process will be just one way - from the primary to the others. While this reduces the occurrences of deadlocks we see, apparently it still doesn't resolve the issue.

Mihail Vukadinoff August 3, 2023 at 2:59 PM
Hi, just wanted to update, that we switched one of the production DB clusters that failed the most in the past months to a debug image with debug logging. We also removed the pxc_strict_mode setting there.
I'm attaching the detailed log from the startup for more information.
We still haven't had a lockup with it for the past few days, but I'm hoping if we do the debug log could give us more insights.
Details
Details
Assignee
Reporter

Needs QA
Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist

Entire cluster hangs and application reports deadlocks.
We have many database cluster instances deployed on a Kubernetes cluster using the PXC operator.
Current versions used:
percona-xtradb-cluster-operator:1.8.0
percona/percona-xtradb-cluster:8.0.26-16.1
Occasionally something happens and the clusters appear hung entirely. The applications report they're getting a deadlock. The only way out of this situation is to stop all pods simultaneously and start back up again. We're thinking that releases the locks.
From K8s point of view the pxc's appears as the components are in a running state.
The problem seems to start exactly when the errors start.
Failed to report last committed 14018292-xxxxxxxxxxxxxxxx33, -110 (Connection timed out)"}
We don't have evidence for a prolonged network outages, but there might have been issues or disconnects. We have network monitoring that hasn't registered any huge packet losses etc. As the environment is running on a cloud provider where we've seen sporadic network issues in the past.
Since this time 2 out of 30 database clusters stuck roughly at the same time, we have a suspicion that the trigger of the problem is external - for example short network issue, but we have no evidence of that atm. Still the aim is to have a resilient configuration that is able to recover by itself.
This behavior is noticed every other week, so about 2 times per month we're hit by this and it causes total application outage until the DB pods are reinit
Application log:
Database pod logs:
PXC side-car log