Issues
- Liveness probe is not working after recoveryK8SPXC-1292Resolved issue: K8SPXC-1292
- [BUG] xtradb-operator fails to delete the PVCs and secrets if it crashes and restarts in the middle of deleteStatefulSet()K8SPXC-979Resolved issue: K8SPXC-979
- xtradb operator don't apply kube-api-access (volume mount) to pxc statefulsetK8SPXC-930Resolved issue: K8SPXC-930
- Updating the Percona Operator to 1.9.0 or 1.10.0 does not delete existing backup cronjobsK8SPXC-925Resolved issue: K8SPXC-925dmitriy.kostiuk
- Pods Are Not Cleaned Up When Deleting Failed Backup ResourcesK8SPXC-921Resolved issue: K8SPXC-921
- Backup Jobs Fail IntermittentlyK8SPXC-920Resolved issue: K8SPXC-920Slava Sarzhan
- [BUG] xtradb operator does not delete PVC after scaling-down leading to resource leakK8SPXC-918
- Status of cluster is not updated correctlyK8SPXC-908Resolved issue: K8SPXC-908Slava Sarzhan
- Provide a way to use jemalloc for mysqldK8SPXC-907Resolved issue: K8SPXC-907Andrii Dema
- HAProxy proxy_protocol_networks setting not working after v1.9.0K8SPXC-902Resolved issue: K8SPXC-902Slava Sarzhan
- reload startup option not working in proxysql clusterK8SPXC-900Resolved issue: K8SPXC-900Slava Sarzhan
- sql_mode=VERIFY_IDENTITY not working with HAProxy and cert-managerK8SPXC-899Resolved issue: K8SPXC-899Andrii Dema
- Unstable scheduled backups with PXC 8.0.23K8SPXC-898Resolved issue: K8SPXC-898Tomislav Plavcic
- [BUG] Operator never creates ssl-internal certificate if crash happens at some particular pointK8SPXC-897Resolved issue: K8SPXC-897
- [BUG] Operator cannot create ssl-internal secret if crash happens at some particular pointK8SPXC-896Resolved issue: K8SPXC-896ege.gunes
- Operator 1.9.0 Refuses To Deploy Cluster Configured Without A ProxyK8SPXC-888Resolved issue: K8SPXC-888
- Document of PITR IssueK8SPXC-887Resolved issue: K8SPXC-887dmitriy.kostiuk
- Backup restore to a non-persistent cluster failsK8SPXC-884Sergey Pronin
- Operator labels are not updated during upgradeK8SPXC-880Resolved issue: K8SPXC-880dmitriy.kostiuk
- '/var/lib/mysql/pxc-entrypoint.sh': Permission denied Error.K8SPXC-879
- EKS 1.21 does not work with OperatorK8SPXC-877Resolved issue: K8SPXC-877Dmytro Zghoba
- ${clustername}-pxc-unready not publishedK8SPXC-876Resolved issue: K8SPXC-876Dmytro Zghoba
- Cannot to remove PXC manual backup for PVC storageK8SPXC-871Resolved issue: K8SPXC-871Dmytro Zghoba
- CRD not updated after Helm upgrade v1.8.0 to v1.9.0K8SPXC-869Resolved issue: K8SPXC-869
- sidecarResources are not applied to custom defined sidecarK8SPXC-868Resolved issue: K8SPXC-868dmitriy.kostiuk
- operator error logK8SPXC-864Resolved issue: K8SPXC-864
- I have encountered some problems using the pxc operator in k8s v1.20.4K8SPXC-861Resolved issue: K8SPXC-861Slava Sarzhan
- Old replica configuration is not purged if a channel is renamed in cr.ymlK8SPXC-859Resolved issue: K8SPXC-859Slava Sarzhan
- pod-0 on replica does not automatically reconnect to source after I re-create itK8SPXC-853Resolved issue: K8SPXC-853
- Changing replication user password does not workK8SPXC-851Resolved issue: K8SPXC-851Bulat Zamalutdinov
- Weight is not set by default for a host in a replication channelK8SPXC-850Resolved issue: K8SPXC-850Bulat Zamalutdinov
- Backup finalizer does not delete data from S3 if folder is specifiedK8SPXC-842Resolved issue: K8SPXC-842Bulat Zamalutdinov
- Upgrade of PXC 5.7 with operator 1.8.0 to 1.9.0 reports replication errorsK8SPXC-839Resolved issue: K8SPXC-839Bulat Zamalutdinov
- proxysql errors when used in replica clusterK8SPXC-835Resolved issue: K8SPXC-835Slava Sarzhan
- error when switching replica and source cluster rolesK8SPXC-833Resolved issue: K8SPXC-833Bulat Zamalutdinov
- not all replication pod switches are logged in the operator logsK8SPXC-832Resolved issue: K8SPXC-832Bulat Zamalutdinov
- PMM user gets locked out on PMM side after changing password on operator sideK8SPXC-823Resolved issue: K8SPXC-823
- custom config from secret is not mounted to proxysqlK8SPXC-821Resolved issue: K8SPXC-821Mykola Marzhan
- PXC backup cluster name is wrong in kubectl outputK8SPXC-819Resolved issue: K8SPXC-819ege.gunes
- pods not restarted if custom config is updated inside secret or configmapK8SPXC-818Resolved issue: K8SPXC-818Mykola Marzhan
- ready count in cr status can be higher than size valueK8SPXC-815Resolved issue: K8SPXC-815ege.gunes
- missing CR status when invalid option specifiedK8SPXC-814Resolved issue: K8SPXC-814
- restore doesn't error on wrong AWS credentialsK8SPXC-813Resolved issue: K8SPXC-813ege.gunes
- HAProxy ready nodes missing in cr statusK8SPXC-811Resolved issue: K8SPXC-811Slava Sarzhan
- [BUG] Proxysql statefulset, PVC and services get mistakenly deleted when reading stale proxysql informationK8SPXC-763Resolved issue: K8SPXC-763ege.gunes
- [BUG] HAproxy statefulset and services get mistakenly deleted when reading stale `spec.haproxy.enabled`K8SPXC-725Resolved issue: K8SPXC-725Slava Sarzhan
- [BUG] StatefulSet and PVC get mistakenly deleted when reading stale PerconaXtraDBCluster informationK8SPXC-716Resolved issue: K8SPXC-716
- Second PerconaXtraDBClusterRestore Always FailsK8SPXC-515Resolved issue: K8SPXC-515natalia.marukovich
Liveness probe is not working after recovery
Description
Environment
Details
Assignee
UnassignedUnassignedReporter
Mihail VukadinoffMihail VukadinoffNeeds QA
YesComponents
Affects versions
Priority
High
Details
Details
Assignee
Reporter
Needs QA
Components
Affects versions
Priority
Smart Checklist
Smart Checklist
Smart Checklist
Activity
Sveta SmirnovaDecember 13, 2023 at 9:56 PM
Thank you for the feedback.
What you report looks more like failed recovery, not failed liveness check. Recovery may fail for different reasons. Since you don't have repeatable test case I am closing this report as "Cannot reproduce". If you manage to find out what causes recovery to fail, open new report with details and steps to reproduce.
Mihail VukadinoffSeptember 15, 2023 at 10:33 AM
Thanks, we switched to the debug image for a while since we were experiencing other issues described here: https://jira.percona.com/browse/PXC-4281
And wanted to get more info on what's actually happening.
But recent occurrence made me think that the problems might be related. Could it be that the debug image might act different from the normal image in regard of the recovery process and this lock file ?
Int this particular case the DB cluster was running fine for more than 30 days and only now we got an alarm that the apps cannot reach the database.
Unfortunately there are no clear steps to reproduce. If the file is not there when I force delete and go to recovery afterwards when finishes it's not there when using another test database.
However if it is there from before it seems it doesn't get cleared.
I wasn't able to reproduce when I force deleted all pods. When recovery completes it seems to remove the file fine.
So does this mean we were stuck in recovery on those databases ?
Slava SarzhanSeptember 15, 2023 at 7:59 AM
Hi
STR - (steps to reproduce). Why do you need to use 'percona/percona-xtradb-cluster:8.0.32-24.2-debug'? This image was very useful in case of manual recovery or if you need to have some additional rpms to collect core dump.
P.S. Full cluster crush recovery is very easy to test. Just delete all PXC pods together with '--force' flag.
Mihail VukadinoffSeptember 14, 2023 at 3:02 PM
btw, Percona containers themselves are pretty new:
percona/percona-xtradb-cluster:8.0.32-24.2-debug
Mihail VukadinoffSeptember 14, 2023 at 2:56 PM
Thanks for the quick response Slava.
The cluster was running for more than 30 days, I don't imagine a recovery was running all that time. It looked like all nodes are healthy. Even after we restarted the whole cluster it went to "full cluster crash" and we revived it with a signal USR1 to the most advanced pod. The file was still remaining there even after we monitored in the logs that the recovery finished on all pods.
I was just reviewing the code and comparing for versions 1.9 if maybe something is different in the handling.
We'll plan an upgrade for sure.
Sorry , I didn't understand the question, what is STR ?
We noticed a Percona cluster was not reachable and haproxys in front were restarting marking 2 out of 3 backends down. After thorough investigation it turned out the process mysqld-ps was not running and there were no listeners on ports 3306 and 33062 on 2 out of the 3 pods.
One 1 of the pods it was still listening , but only 1 couldn't form a quorum.
After looking more closely in the liveness script , it seems it was not really checking if the mysql process listens and responds, because a file called recovery was still present.
This file seems to be created during the recovery process by the PXC scripts, but it seems nothing removes it afterwards, which makes the liveness probe useless. Even if mysql process crashes it will not recognize it, as it exits with 0 when file is present.