Issues

Select view

List view

Detail view

Select search mode

Basic

JQL

peer-list is not restarting haproxy if PXC pod re-created during dns TTL
K8SPXC-1339
Resolved issue: K8SPXC-1339
make an option to use TCP for peer-list dns lookups
K8SPXC-1284
Operator cannot clean replication's failover sources if replications have been stopped
K8SPXC-1256
Resolved issue: K8SPXC-1256
Operator is vulnerable to misoperations for multiple properties in CR and drives the cluster to broken state
K8SPXC-1231
Upgrading Cluster Fails When Dataset Has Large Number Of Tables
K8SPXC-1222
Resolved issue: K8SPXC-1222
Haproxy flooding with monitoring logs generated via health checkup script
K8SPXC-1211
Resolved issue: K8SPXC-1211
Operator recreates secrets under proxysql -> haproxy switch if delete-proxysql-pvc is used.
K8SPXC-1164
Resolved issue: K8SPXC-1164
cluster stuck initializing if one pod unready and passwords changed
K8SPXC-1160
SmartUpdate potentially leads to irrecoverable PXC cluster
K8SPXC-1155
Resolved issue: K8SPXC-1155
Add innodb_log_file_size to auto-tuning
K8SPXC-1154
Pvc finalizer doesn't delete secrets if we use default secret name.
K8SPXC-1149
Resolved issue: K8SPXC-1149
Unable to add, delete and update service labels/annotations for proxysql
K8SPXC-1137
Resolved issue: K8SPXC-1137
PXC Operator goes to CrashLoopBackOff in cluster with 700+ CRDs
K8SPXC-1130
Resolved issue: K8SPXC-1130
Cluster fails to start if you change to Streaming replication
K8SPXC-1119
Resolved issue: K8SPXC-1119
PITR collector gaps hard to monitor
K8SPXC-1118
Resolved issue: K8SPXC-1118
spec.backup.storages.<label>.s3.credentialsSecret is not required but the default is in invalid
K8SPXC-1114
xtrabackup user password change triggers restart
K8SPXC-1103
Support for audit logs
K8SPXC-1102
mysqld_exporter is not restarted after monitor mysql password change
K8SPXC-1101
Resolved issue: K8SPXC-1101
can't use slash in a password for monitor user
K8SPXC-1100
Resolved issue: K8SPXC-1100
CrashLoopBackOff after password change with password_history or password validation
K8SPXC-1099
Resolved issue: K8SPXC-1099
Operator uses "insecure" passwords not passing validation_plugin policies and password_history
K8SPXC-1097
Resolved issue: K8SPXC-1097
process the sigterm signal
K8SPXC-1095
Resolved issue: K8SPXC-1095
Incorrect cluster name mentioned in get secrets for GKE
K8SPXC-1092
Resolved issue: K8SPXC-1092
Configurable PXC restore job CPU resources
K8SPXC-1088
Resolved issue: K8SPXC-1088
PXC fails to start with "cat: /var/run/secrets/kubernetes.io/serviceaccount/namespace: No such file or directory"
K8SPXC-1074
Resolved issue: K8SPXC-1074
Some fields are ineffective due to the too generic PodSpec struct
K8SPXC-1069
Users are unable to delete haproxy labels using the field cr.spec.haproxy.labels
K8SPXC-1068
pxc-monit and proxysql-monit containers print passwords
K8SPXC-1059
Resolved issue: K8SPXC-1059
Provide name and digest values for Percona Certified Images for non-latest releases
K8SPXC-1058
Resolved issue: K8SPXC-1058
Helm - enable support for pmmserverkey in the secret
K8SPXC-1054
Resolved issue: K8SPXC-1054
Defining a sidecarVolume yields a broken StatefulSet
K8SPXC-1048
Resolved issue: K8SPXC-1048
Helm chart upgrade issue
K8SPXC-1047
Resolved issue: K8SPXC-1047
Operator error messages when enabling require_secure_transport
K8SPXC-1041
Resolved issue: K8SPXC-1041
Pmm-client can't connect to pmm-server using password
K8SPXC-1034
replica cluster (cross-site) doesn't work with proxysql
K8SPXC-1029
Update k8s libraries
K8SPXC-967
Resolved issue: K8SPXC-967

37 of 37

peer-list is not restarting haproxy if PXC pod re-created during dns TTL

Done

General

Escalation

General

Escalation

Description

https://github.com/percona/percona-xtradb-cluster-operator/blob/1.11.0-CUSTOM-142/cmd/peer-list/main.go#L141

v1.26.7-gke.500

pxc-2 pod deleted & re-created at 2023-12-20T20:46:27Z
restart + ready happened while dns cache was still valid
pxc-monit container of haproxy pod is not initiating haproxy reload/restart, because while PXC pod restarted a cached value was used.
haproxy container marking pxc-2 as down, because it uses IP address for checks
the process repeats with other two pods: pxc1 at 2023-12-20T20:47:13Z, pxc0 at 2023-12-20T20:46:27Z
haproxy servers marking all pxc servers as down, pxc-monit not reloaded haproxy containers to flush dns cache
full cluster outage

A similar issue is described in a haproxy bug tracker:

https://github.com/haproxy/haproxy/issues/1278

Developers suggested to tune resolver properly:

https://docs.haproxy.org/2.6/configuration.html#5.3

Possible solution:

a) change haproxy configuration to refresh pxc IP address automatically

b) modify pxc-monit to collect IP addresses + dns names pair and do a reload if IP addresses changed

Environment

None

AFFECTED CS IDs

CS0042115

Details
Assignee
ege.gunes
Reporter
Nickolay Ihalainen(Deactivated)
Needs QA
Yes
Needs Doc
Yes
Fix versions
1.14.0
Affects versions
1.11.0
1.13.0
Priority
High

Smart Checklist

Created January 2, 2024 at 3:00 PM

Updated June 11, 2024 at 7:57 AM

Resolved January 15, 2024 at 8:35 AM

Configure

Activity

Show:

Slava SarzhanJanuary 15, 2024 at 8:35 AM

@Nickolay Ihalainen good news. Thank you for your testing. We have included this fix in the next PXCO release.

Nickolay IhalainenJanuary 11, 2024 at 2:53 PM

@ege.gunes I can confirm that having resolvers kubernete in .spec.haproxy.configuration and HA_SERVER_OPTIONS: cmVzb2x2ZXJzIGt1YmVybmV0ZXMgY2hlY2sgaW50ZXIgMzAwMDAgcmlzZSAxIGZhbGwgNSB3ZWlnaHQgMQ==

solves the issue.

Setup pxc cluster with resolver enabled
Add cluster1-env-vars-haproxy secret
freeze peer-list on harpoxy-0
Connect from haproxy-0 to the localhost (connected via haproxy-0 to cluster1-haproxy-0
Delete pxc-0
new queries hang, after pxc-0 restart new queries working as expected, even if peer-list is not reloading the configuration.

ege.gunesJanuary 9, 2024 at 8:29 AM

I added a custom resolver to HAProxy config:

    resolvers kubernetes
      parse-resolv-conf

and configured servers to use this resolver:

    backend galera-nodes
      mode tcp
      option srvtcpka
      balance roundrobin
      option external-check
      external-check command /usr/local/bin/check_pxc.sh
      server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:3306  resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions
      server cluster1-pxc-2 cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local:3306  resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
      server cluster1-pxc-1 cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local:3306  resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
    
    backend galera-admin-nodes
      mode tcp
      option srvtcpka
      balance roundrobin
      option external-check
      external-check command /usr/local/bin/check_pxc.sh
      server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33062 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions
      server cluster1-pxc-2 cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local:33062 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
      server cluster1-pxc-1 cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local:33062 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
    
    backend galera-replica-nodes
      mode tcp
      option srvtcpka
      balance roundrobin
      option external-check
      external-check command /usr/local/bin/check_pxc.sh
      server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:3306  resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1
      server cluster1-pxc-1 cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local:3306  resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1
      server cluster1-pxc-2 cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local:3306  resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1
    
    backend galera-mysqlx-nodes
      mode tcp
      option srvtcpka
      balance roundrobin
      option external-check
      external-check command /usr/local/bin/check_pxc.sh
      server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33060 resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions
      server cluster1-pxc-2 cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local:33060  resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup
      server cluster1-pxc-1 cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local:33060  resolvers kubernetes check inter 10000 rise 1 fall 2 weight 1 backup

To simulate the same scenario I’m doing the following:

attach gdb to peer-list processes in each HAProxy pod (via `kubectl debug -it --profile general pod/cluster1-haproxy-0 --image debian --target pxc-monit -- bash`)
delete cluster1-pxc-1 and cluster1-pxc-2
wait until ready, wait for additional 30 seconds (dns TTL)
continue peer-list execution
peer-list is not detecting that two servers restarted, no messages in a log after continue

After these steps I observe that svc/cluster1-haproxy-replicas sends queries to cluster1-pxc-1 and cluster1-pxc2. So problem seems fixed by custom resolver but I’m not fully confident with this fix.

HAProxy docs say that it creates a resolver named default:

At startup, HAProxy tries to generate a resolvers section named "default", if no section was named this way in the configuration. This section is used by default by the httpclient and uses the parse-resolv-conf keyword. If HAProxy failed to generate automatically this section, no error or warning are emitted.

If there’s a resolver called default that uses parse-resolv-conf it should already behave like above because I’m not doing anything fancy in my resolver, it also just uses parse-resolv-conf. @Nickolay Ihalainen I’m bit confused. You can also check it yourself using perconalab/percona-xtradb-cluster-operator:k8spxc-1339-haproxy-4 as HAProxy image.

Nickolay IhalainenJanuary 3, 2024 at 10:42 AM

during RCA for such cases we should know how long peer-list spent in a loop. Current period is 1 second, but real loop time could be significantly longer for slow DNS/network/pxc-monit pod. It will be great to have info/warning message (with possible throttling “occurred N times”) for loops longer than 10-20 seconds.

Nickolay IhalainenJanuary 3, 2024 at 8:52 AM

steps to simulate the problem (it’s hard to get exactly the same behavior as on production system)

Situation A:

Freeze peer-list process (I’m attaching gdb on haproxy-0)
kubectl delete -n pxc cluster1-pxc-1 cluster1-pxc-2
wait until ready, wait for additional 30 seconds (dns TTL)
continue peer-list execution
peer-list is not detecting that two servers restarted, no messages in a log after continue
haproxy-0 is not sending any queries to secondaries (pxc-1 and pxc-2)
this happens forever if pxc is stable

Situation B:

Freeze peer-list process (I’m attaching gdb on haproxy-0)
kubectl delete -n pxc cluster1-pxc-1 cluster1-pxc-2
wait until ready
kubectl delete -n pxc cluster1-pxc-0
haproxy is not unavailable
continue peer-list execution
haproxy still not ready
haproxy container restarted due to liveness check

Issues

peer-list is not restarting haproxy if PXC pod re-created during dns TTL

Description

Environment

AFFECTED CS IDs

DetailsAssigneeege.gunesege.gunesReporterNickolay IhalainenNickolay Ihalainen(Deactivated)Needs QAYesNeeds DocYesFix versions1.14.0Affects versions1.11.01.13.0PriorityHigh

Details

Assignee

Reporter

Needs QA

Needs Doc

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Activity

Slava SarzhanJanuary 15, 2024 at 8:35 AM

Nickolay IhalainenJanuary 11, 2024 at 2:53 PM

ege.gunesJanuary 9, 2024 at 8:29 AM

Nickolay IhalainenJanuary 3, 2024 at 10:42 AM

Nickolay IhalainenJanuary 3, 2024 at 8:52 AM

Details
Assignee
ege.gunes
Reporter
Nickolay Ihalainen(Deactivated)
Needs QA
Yes
Needs Doc
Yes
Fix versions
1.14.0
Affects versions
1.11.0
1.13.0
Priority
High

Smart Checklist