Issues

Select view

List view

Detail view

Select search mode

Basic

JQL

Liveness probe is not working after recovery
K8SPXC-1292
Resolved issue: K8SPXC-1292
[BUG] xtradb-operator fails to delete the PVCs and secrets if it crashes and restarts in the middle of deleteStatefulSet()
K8SPXC-979
Resolved issue: K8SPXC-979
xtradb operator don't apply kube-api-access (volume mount) to pxc statefulset
K8SPXC-930
Resolved issue: K8SPXC-930
Updating the Percona Operator to 1.9.0 or 1.10.0 does not delete existing backup cronjobs
K8SPXC-925
Resolved issue: K8SPXC-925
Pods Are Not Cleaned Up When Deleting Failed Backup Resources
K8SPXC-921
Resolved issue: K8SPXC-921
Backup Jobs Fail Intermittently
K8SPXC-920
Resolved issue: K8SPXC-920
[BUG] xtradb operator does not delete PVC after scaling-down leading to resource leak
K8SPXC-918
Status of cluster is not updated correctly
K8SPXC-908
Resolved issue: K8SPXC-908
Provide a way to use jemalloc for mysqld
K8SPXC-907
Resolved issue: K8SPXC-907
HAProxy proxy_protocol_networks setting not working after v1.9.0
K8SPXC-902
Resolved issue: K8SPXC-902
reload startup option not working in proxysql cluster
K8SPXC-900
Resolved issue: K8SPXC-900
sql_mode=VERIFY_IDENTITY not working with HAProxy and cert-manager
K8SPXC-899
Resolved issue: K8SPXC-899
Unstable scheduled backups with PXC 8.0.23
K8SPXC-898
Resolved issue: K8SPXC-898
[BUG] Operator never creates ssl-internal certificate if crash happens at some particular point
K8SPXC-897
Resolved issue: K8SPXC-897
[BUG] Operator cannot create ssl-internal secret if crash happens at some particular point
K8SPXC-896
Resolved issue: K8SPXC-896
Operator 1.9.0 Refuses To Deploy Cluster Configured Without A Proxy
K8SPXC-888
Resolved issue: K8SPXC-888
Document of PITR Issue
K8SPXC-887
Resolved issue: K8SPXC-887
Backup restore to a non-persistent cluster fails
K8SPXC-884
Operator labels are not updated during upgrade
K8SPXC-880
Resolved issue: K8SPXC-880
'/var/lib/mysql/pxc-entrypoint.sh': Permission denied Error.
K8SPXC-879
EKS 1.21 does not work with Operator
K8SPXC-877
Resolved issue: K8SPXC-877
${clustername}-pxc-unready not published
K8SPXC-876
Resolved issue: K8SPXC-876
Cannot to remove PXC manual backup for PVC storage
K8SPXC-871
Resolved issue: K8SPXC-871
CRD not updated after Helm upgrade v1.8.0 to v1.9.0
K8SPXC-869
Resolved issue: K8SPXC-869
sidecarResources are not applied to custom defined sidecar
K8SPXC-868
Resolved issue: K8SPXC-868
operator error log
K8SPXC-864
Resolved issue: K8SPXC-864
I have encountered some problems using the pxc operator in k8s v1.20.4
K8SPXC-861
Resolved issue: K8SPXC-861
Old replica configuration is not purged if a channel is renamed in cr.yml
K8SPXC-859
Resolved issue: K8SPXC-859
pod-0 on replica does not automatically reconnect to source after I re-create it
K8SPXC-853
Resolved issue: K8SPXC-853
Changing replication user password does not work
K8SPXC-851
Resolved issue: K8SPXC-851
Weight is not set by default for a host in a replication channel
K8SPXC-850
Resolved issue: K8SPXC-850
Backup finalizer does not delete data from S3 if folder is specified
K8SPXC-842
Resolved issue: K8SPXC-842
Upgrade of PXC 5.7 with operator 1.8.0 to 1.9.0 reports replication errors
K8SPXC-839
Resolved issue: K8SPXC-839
proxysql errors when used in replica cluster
K8SPXC-835
Resolved issue: K8SPXC-835
error when switching replica and source cluster roles
K8SPXC-833
Resolved issue: K8SPXC-833
not all replication pod switches are logged in the operator logs
K8SPXC-832
Resolved issue: K8SPXC-832
PMM user gets locked out on PMM side after changing password on operator side
K8SPXC-823
Resolved issue: K8SPXC-823
custom config from secret is not mounted to proxysql
K8SPXC-821
Resolved issue: K8SPXC-821
PXC backup cluster name is wrong in kubectl output
K8SPXC-819
Resolved issue: K8SPXC-819
pods not restarted if custom config is updated inside secret or configmap
K8SPXC-818
Resolved issue: K8SPXC-818
ready count in cr status can be higher than size value
K8SPXC-815
Resolved issue: K8SPXC-815
missing CR status when invalid option specified
K8SPXC-814
Resolved issue: K8SPXC-814
restore doesn't error on wrong AWS credentials
K8SPXC-813
Resolved issue: K8SPXC-813
HAProxy ready nodes missing in cr status
K8SPXC-811
Resolved issue: K8SPXC-811
[BUG] Proxysql statefulset, PVC and services get mistakenly deleted when reading stale proxysql information
K8SPXC-763
Resolved issue: K8SPXC-763
[BUG] HAproxy statefulset and services get mistakenly deleted when reading stale `spec.haproxy.enabled`
K8SPXC-725
Resolved issue: K8SPXC-725
[BUG] StatefulSet and PVC get mistakenly deleted when reading stale PerconaXtraDBCluster information
K8SPXC-716
Resolved issue: K8SPXC-716
Second PerconaXtraDBClusterRestore Always Fails
K8SPXC-515
Resolved issue: K8SPXC-515

48 of 48

Liveness probe is not working after recovery

Cannot Reproduce

General

Escalation

General

Escalation

Description

We noticed a Percona cluster was not reachable and haproxys in front were restarting marking 2 out of 3 backends down. After thorough investigation it turned out the process mysqld-ps was not running and there were no listeners on ports 3306 and 33062 on 2 out of the 3 pods.

One 1 of the pods it was still listening , but only 1 couldn't form a quorum.

After looking more closely in the liveness script , it seems it was not really checking if the mysql process listens and responds, because a file called recovery was still present.

This file seems to be created during the recovery process by the PXC scripts, but it seems nothing removes it afterwards, which makes the liveness probe useless. Even if mysql process crashes it will not recognize it, as it exits with 0 when file is present.

Environment

None

Details
Assignee
Unassigned
Reporter
Mihail Vukadinoff
Needs QA
Yes
Components
Affects versions
1.9.0
Priority
High

Smart Checklist

Created September 14, 2023 at 2:32 PM

Updated December 13, 2023 at 9:56 PM

Resolved December 13, 2023 at 9:56 PM

Activity

Sveta SmirnovaDecember 13, 2023 at 9:56 PM

Thank you for the feedback.

What you report looks more like failed recovery, not failed liveness check. Recovery may fail for different reasons. Since you don't have repeatable test case I am closing this report as "Cannot reproduce". If you manage to find out what causes recovery to fail, open new report with details and steps to reproduce.

Mihail VukadinoffSeptember 15, 2023 at 10:33 AM

Thanks, we switched to the debug image for a while since we were experiencing other issues described here: https://jira.percona.com/browse/PXC-4281
And wanted to get more info on what's actually happening.
But recent occurrence made me think that the problems might be related. Could it be that the debug image might act different from the normal image in regard of the recovery process and this lock file ?

Int this particular case the DB cluster was running fine for more than 30 days and only now we got an alarm that the apps cannot reach the database.

Unfortunately there are no clear steps to reproduce. If the file is not there when I force delete and go to recovery afterwards when finishes it's not there when using another test database.

However if it is there from before it seems it doesn't get cleared.
I wasn't able to reproduce when I force deleted all pods. When recovery completes it seems to remove the file fine.

So does this mean we were stuck in recovery on those databases ?

Slava SarzhanSeptember 15, 2023 at 7:59 AM

Hi
STR - (steps to reproduce). Why do you need to use 'percona/percona-xtradb-cluster:8.0.32-24.2-debug'? This image was very useful in case of manual recovery or if you need to have some additional rpms to collect core dump.

P.S. Full cluster crush recovery is very easy to test. Just delete all PXC pods together with '--force' flag.

Mihail VukadinoffSeptember 14, 2023 at 3:02 PM

btw, Percona containers themselves are pretty new:

percona/percona-xtradb-cluster:8.0.32-24.2-debug

Mihail VukadinoffSeptember 14, 2023 at 2:56 PM

Thanks for the quick response Slava.

The cluster was running for more than 30 days, I don't imagine a recovery was running all that time. It looked like all nodes are healthy. Even after we restarted the whole cluster it went to "full cluster crash" and we revived it with a signal USR1 to the most advanced pod. The file was still remaining there even after we monitored in the logs that the recovery finished on all pods.

I was just reviewing the code and comparing for versions 1.9 if maybe something is different in the handling.

We'll plan an upgrade for sure.

Sorry , I didn't understand the question, what is STR ?

Issues

Liveness probe is not working after recovery

Description

Environment

DetailsAssigneeUnassignedUnassignedReporterMihail VukadinoffMihail VukadinoffNeeds QAYesComponentsAffects versions1.9.0PriorityHigh

Details

Assignee

Reporter

Needs QA

Components

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Activity

Sveta SmirnovaDecember 13, 2023 at 9:56 PM

Mihail VukadinoffSeptember 15, 2023 at 10:33 AM

Slava SarzhanSeptember 15, 2023 at 7:59 AM

Mihail VukadinoffSeptember 14, 2023 at 3:02 PM

Mihail VukadinoffSeptember 14, 2023 at 2:56 PM

Details
Assignee
Unassigned
Reporter
Mihail Vukadinoff
Needs QA
Yes
Components
Affects versions
1.9.0
Priority
High

Smart Checklist