Issues
- peer-list is not restarting haproxy if PXC pod re-created during dns TTLK8SPXC-1339Resolved issue: K8SPXC-1339ege.gunes
- make an option to use TCP for peer-list dns lookupsK8SPXC-1284George Kechagias
- Operator cannot clean replication's failover sources if replications have been stoppedK8SPXC-1256Resolved issue: K8SPXC-1256ege.gunes
- Operator is vulnerable to misoperations for multiple properties in CR and drives the cluster to broken stateK8SPXC-1231
- Upgrading Cluster Fails When Dataset Has Large Number Of TablesK8SPXC-1222Resolved issue: K8SPXC-1222ege.gunes
- Haproxy flooding with monitoring logs generated via health checkup scriptK8SPXC-1211Resolved issue: K8SPXC-1211Slava Sarzhan
- Operator recreates secrets under proxysql -> haproxy switch if delete-proxysql-pvc is used.K8SPXC-1164Resolved issue: K8SPXC-1164Andrii Dema
- cluster stuck initializing if one pod unready and passwords changedK8SPXC-1160inel.pandzic
- SmartUpdate potentially leads to irrecoverable PXC clusterK8SPXC-1155Resolved issue: K8SPXC-1155Julio Pasinatto
- Add innodb_log_file_size to auto-tuningK8SPXC-1154
- Pvc finalizer doesn't delete secrets if we use default secret name.K8SPXC-1149Resolved issue: K8SPXC-1149Andrii Dema
- Unable to add, delete and update service labels/annotations for proxysqlK8SPXC-1137Resolved issue: K8SPXC-1137natalia.marukovich
- PXC Operator goes to CrashLoopBackOff in cluster with 700+ CRDsK8SPXC-1130Resolved issue: K8SPXC-1130
- Cluster fails to start if you change to Streaming replicationK8SPXC-1119Resolved issue: K8SPXC-1119ege.gunes
- PITR collector gaps hard to monitorK8SPXC-1118Resolved issue: K8SPXC-1118ege.gunes
- spec.backup.storages.<label>.s3.credentialsSecret is not required but the default is in invalidK8SPXC-1114
- xtrabackup user password change triggers restartK8SPXC-1103
- Support for audit logsK8SPXC-1102
- mysqld_exporter is not restarted after monitor mysql password changeK8SPXC-1101Resolved issue: K8SPXC-1101
- can't use slash in a password for monitor userK8SPXC-1100Resolved issue: K8SPXC-1100
- CrashLoopBackOff after password change with password_history or password validationK8SPXC-1099Resolved issue: K8SPXC-1099inel.pandzic
- Operator uses "insecure" passwords not passing validation_plugin policies and password_historyK8SPXC-1097Resolved issue: K8SPXC-1097
- process the sigterm signalK8SPXC-1095Resolved issue: K8SPXC-1095
- Incorrect cluster name mentioned in get secrets for GKEK8SPXC-1092Resolved issue: K8SPXC-1092dmitriy.kostiuk
- Configurable PXC restore job CPU resourcesK8SPXC-1088Resolved issue: K8SPXC-1088Andrii Dema
- PXC fails to start with "cat: /var/run/secrets/kubernetes.io/serviceaccount/namespace: No such file or directory"K8SPXC-1074Resolved issue: K8SPXC-1074natalia.marukovich
- Some fields are ineffective due to the too generic PodSpec structK8SPXC-1069
- Users are unable to delete haproxy labels using the field cr.spec.haproxy.labelsK8SPXC-1068
- pxc-monit and proxysql-monit containers print passwordsK8SPXC-1059Resolved issue: K8SPXC-1059Tomislav Plavcic
- Provide name and digest values for Percona Certified Images for non-latest releasesK8SPXC-1058Resolved issue: K8SPXC-1058dmitriy.kostiuk
- Helm - enable support for pmmserverkey in the secretK8SPXC-1054Resolved issue: K8SPXC-1054Tomislav Plavcic
- Defining a sidecarVolume yields a broken StatefulSetK8SPXC-1048Resolved issue: K8SPXC-1048Tomislav Plavcic
- Helm chart upgrade issueK8SPXC-1047Resolved issue: K8SPXC-1047Tomislav Plavcic
- Operator error messages when enabling require_secure_transportK8SPXC-1041Resolved issue: K8SPXC-1041dmitriy.kostiuk
- Pmm-client can't connect to pmm-server using passwordK8SPXC-1034Andrii Dema
- replica cluster (cross-site) doesn't work with proxysqlK8SPXC-1029
- Update k8s librariesK8SPXC-967Resolved issue: K8SPXC-967ege.gunes
peer-list is not restarting haproxy if PXC pod re-created during dns TTL
Description
Environment
AFFECTED CS IDs
Activity
Slava SarzhanJanuary 15, 2024 at 8:35 AM
@Nickolay Ihalainen good news. Thank you for your testing. We have included this fix in the next PXCO release.
Nickolay IhalainenJanuary 11, 2024 at 2:53 PM
@ege.gunes I can confirm that having resolvers kubernete in .spec.haproxy.configuration and HA_SERVER_OPTIONS: cmVzb2x2ZXJzIGt1YmVybmV0ZXMgY2hlY2sgaW50ZXIgMzAwMDAgcmlzZSAxIGZhbGwgNSB3ZWlnaHQgMQ==
solves the issue.
Setup pxc cluster with resolver enabled
freeze peer-list on harpoxy-0
Connect from haproxy-0 to the localhost (connected via haproxy-0 to cluster1-haproxy-0
Delete pxc-0
new queries hang, after pxc-0 restart new queries working as expected, even if peer-list is not reloading the configuration.
ege.gunesJanuary 9, 2024 at 8:29 AM
I added a custom resolver to HAProxy config:
resolvers kubernetes
parse-resolv-conf
and configured servers to use this resolver:
backend galera-nodes
mode tcp
option srvtcpka
balance roundrobin
option external-check
external-check command /usr/local/bin/check_pxc.sh
server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:3306 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions
server cluster1-pxc-2 cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local:3306 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
server cluster1-pxc-1 cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local:3306 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
backend galera-admin-nodes
mode tcp
option srvtcpka
balance roundrobin
option external-check
external-check command /usr/local/bin/check_pxc.sh
server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33062 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions
server cluster1-pxc-2 cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local:33062 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
server cluster1-pxc-1 cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local:33062 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
backend galera-replica-nodes
mode tcp
option srvtcpka
balance roundrobin
option external-check
external-check command /usr/local/bin/check_pxc.sh
server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:3306 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1
server cluster1-pxc-1 cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local:3306 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1
server cluster1-pxc-2 cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local:3306 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1
backend galera-mysqlx-nodes
mode tcp
option srvtcpka
balance roundrobin
option external-check
external-check command /usr/local/bin/check_pxc.sh
server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33060 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions
server cluster1-pxc-2 cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local:33060 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
server cluster1-pxc-1 cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local:33060 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
To simulate the same scenario I’m doing the following:
attach gdb to peer-list processes in each HAProxy pod (via `kubectl debug -it --profile general pod/cluster1-haproxy-0 --image debian --target pxc-monit -- bash`)
delete cluster1-pxc-1 and cluster1-pxc-2
wait until ready, wait for additional 30 seconds (dns TTL)
continue peer-list execution
peer-list is not detecting that two servers restarted, no messages in a log after continue
After these steps I observe that svc/cluster1-haproxy-replicas sends queries to cluster1-pxc-1 and cluster1-pxc2. So problem seems fixed by custom resolver but I’m not fully confident with this fix.
HAProxy docs say that it creates a resolver named default:
At startup, HAProxy tries to generate a resolvers section named "default", if no section was named this way in the configuration. This section is used by default by the httpclient and uses the parse-resolv-conf keyword. If HAProxy failed to generate automatically this section, no error or warning are emitted.
If there’s a resolver called default that uses parse-resolv-conf
it should already behave like above because I’m not doing anything fancy in my resolver, it also just uses parse-resolv-conf
. @Nickolay Ihalainen I’m bit confused. You can also check it yourself using perconalab/percona-xtradb-cluster-operator:k8spxc-1339-haproxy-4
as HAProxy image.
Nickolay IhalainenJanuary 3, 2024 at 10:42 AM
during RCA for such cases we should know how long peer-list spent in a loop. Current period is 1 second, but real loop time could be significantly longer for slow DNS/network/pxc-monit pod. It will be great to have info/warning message (with possible throttling “occurred N times”) for loops longer than 10-20 seconds.
Nickolay IhalainenJanuary 3, 2024 at 8:52 AM
steps to simulate the problem (it’s hard to get exactly the same behavior as on production system)
Situation A:
Freeze peer-list process (I’m attaching gdb on haproxy-0)
kubectl delete -n pxc cluster1-pxc-1 cluster1-pxc-2
wait until ready, wait for additional 30 seconds (dns TTL)
continue peer-list execution
peer-list is not detecting that two servers restarted, no messages in a log after continue
haproxy-0 is not sending any queries to secondaries (pxc-1 and pxc-2)
this happens forever if pxc is stable
Situation B:
Freeze peer-list process (I’m attaching gdb on haproxy-0)
kubectl delete -n pxc cluster1-pxc-1 cluster1-pxc-2
wait until ready
kubectl delete -n pxc cluster1-pxc-0
haproxy is not unavailable
continue peer-list execution
haproxy still not ready
haproxy container restarted due to liveness check
https://github.com/percona/percona-xtradb-cluster-operator/blob/1.11.0-CUSTOM-142/cmd/peer-list/main.go#L141
v1.26.7-gke.500
pxc-2 pod deleted & re-created at 2023-12-20T20:46:27Z
restart + ready happened while dns cache was still valid
pxc-monit container of haproxy pod is not initiating haproxy reload/restart, because while PXC pod restarted a cached value was used.
haproxy container marking pxc-2 as down, because it uses IP address for checks
the process repeats with other two pods: pxc1 at 2023-12-20T20:47:13Z, pxc0 at 2023-12-20T20:46:27Z
haproxy servers marking all pxc servers as down, pxc-monit not reloaded haproxy containers to flush dns cache
full cluster outage
A similar issue is described in a haproxy bug tracker:
https://github.com/haproxy/haproxy/issues/1278
Developers suggested to tune resolver properly:
https://docs.haproxy.org/2.6/configuration.html#5.3
Possible solution:
a) change haproxy configuration to refresh pxc IP address automatically
b) modify pxc-monit to collect IP addresses + dns names pair and do a reload if IP addresses changed