Issues
- DNS resolution problem forces haproxy to remove all pxc nodes, including alive onesK8SPXC-1220Resolved issue: K8SPXC-1220ege.gunes
- k8s worker fault causes endless Terminating status for pxc pod and huge messages in logsK8SPXC-1216ege.gunes
- mc: <ERROR> Unable to initialize new alias from the provided credentials. The secret key required to complete authentication could not be found. The region must be specified if this is not the home region f │ │ or the tenancy.K8SPXC-1128Resolved issue: K8SPXC-1128Slava Sarzhan
- PiTR support for self-signed S3 certificatesK8SPXC-1105Resolved issue: K8SPXC-1105natalia.marukovich
- mysqld_exporter is not restarted after monitor mysql password changeK8SPXC-1101Resolved issue: K8SPXC-1101
- CrashLoopBackOff after password change with password_history or password validationK8SPXC-1099Resolved issue: K8SPXC-1099inel.pandzic
- Operator uses "insecure" passwords not passing validation_plugin policies and password_historyK8SPXC-1097Resolved issue: K8SPXC-1097
- Defining a sidecarVolume yields a broken StatefulSetK8SPXC-1048Resolved issue: K8SPXC-1048Tomislav Plavcic
- Add labels/annotations to servicesK8SPXC-1046Resolved issue: K8SPXC-1046
- Provide mysqld_exporter imageK8SPXC-1045Resolved issue: K8SPXC-1045
- Liveness check fails when XtraBackup is running and wsrep_sync_wait is setK8SPXC-1036Resolved issue: K8SPXC-1036Slava Sarzhan
- cert manager certificate renew is not working after delete+applyK8SPXC-1030Resolved issue: K8SPXC-1030dmitriy.kostiuk
- Can't backup 20k+ tables database xbcloud: Failed to upload object. Error: Couldn't connect to server on amazonaws.comK8SPXC-1024Resolved issue: K8SPXC-1024Slava Sarzhan
- Enable super_read_only on replicasK8SPXC-1009
- Full backups fail with socat errorK8SPXC-1004Resolved issue: K8SPXC-1004Tomislav Plavcic
- Log container starts failing with invalid stream_id could not append content to multiline contextK8SPXC-1002Resolved issue: K8SPXC-1002
- Misleading backup finished messageK8SPXC-1000Resolved issue: K8SPXC-1000Slava Sarzhan
- PODs are running out of memoryK8SPXC-995Resolved issue: K8SPXC-995Slava Sarzhan
- get-pxc-state uses root connectionK8SPXC-994Resolved issue: K8SPXC-994Slava Sarzhan
- Restore is failing - PXC Cluster and Xtrabackup versions are not in syncK8SPXC-993
- Creating (SST) backups seem to fail on example configurationK8SPXC-989Resolved issue: K8SPXC-989Slava Sarzhan
- PITR fails due to incorrect binlog filtering logicK8SPXC-985Resolved issue: K8SPXC-985ege.gunes
- Port 3307 is missing from servicesK8SPXC-980Resolved issue: K8SPXC-980
- [BUG] xtradb-operator fails to delete the PVCs and secrets if it crashes and restarts in the middle of deleteStatefulSet()K8SPXC-979Resolved issue: K8SPXC-979
- typo `xtrabcupUser`K8SPXC-975Resolved issue: K8SPXC-975Slava Sarzhan
- Cannot apply annotations, labels, or resource limitations to backup podsK8SPXC-965Resolved issue: K8SPXC-965Tomislav Plavcic
- document that both full backup and binlogs should be on S3K8SPXC-960Resolved issue: K8SPXC-960dmitriy.kostiuk
- replicasServiceType set in helm chart not passed through to operatorK8SPXC-957Resolved issue: K8SPXC-957Tomislav Plavcic
- HA proxy doesn't allow connections after minimum size of pxc cluster is formedK8SPXC-951Resolved issue: K8SPXC-951
- MySQL broken after adding a sidecar to smart-updated clusterK8SPXC-950Slava Sarzhan
- Resume doesn't work for pxc cluster.K8SPXC-938Resolved issue: K8SPXC-938
- Create secret for system users even if 'secretsName' option is commented in CRK8SPXC-934Resolved issue: K8SPXC-934Slava Sarzhan
- xtradb operator don't apply kube-api-access (volume mount) to pxc statefulsetK8SPXC-930Resolved issue: K8SPXC-930
- failed smart update for one cluster makes the operator unusable for other clustersK8SPXC-926Andrii Dema
- Updating the Percona Operator to 1.9.0 or 1.10.0 does not delete existing backup cronjobsK8SPXC-925Resolved issue: K8SPXC-925dmitriy.kostiuk
- [BUG] Operator always configures validation webhook with namespace perconaK8SPXC-923Resolved issue: K8SPXC-923
- CRD's not deployed by helm chart on createCRD=trueK8SPXC-922
- Pods Are Not Cleaned Up When Deleting Failed Backup ResourcesK8SPXC-921Resolved issue: K8SPXC-921
- Backup Jobs Fail IntermittentlyK8SPXC-920Resolved issue: K8SPXC-920Slava Sarzhan
- Error After Upgrading to v1.10.0 When Configured Without A Proxy EnabledK8SPXC-919Resolved issue: K8SPXC-919
- [BUG] xtradb operator does not delete PVC after scaling-down leading to resource leakK8SPXC-918
- Xtrabackup fails on primary node, causing SST failureK8SPXC-912Resolved issue: K8SPXC-912
- Operator gets into crashloop on OpenShiftK8SPXC-911Resolved issue: K8SPXC-911Tomislav Plavcic
- operator constantly prints error msg in the logsK8SPXC-910Resolved issue: K8SPXC-910Mykola Marzhan
- PITR test issuesK8SPXC-905Resolved issue: K8SPXC-905Tomislav Plavcic
- operator tries to add SYSTEM_USER privilege on 5.7 for monitor userK8SPXC-890Resolved issue: K8SPXC-890Dmytro Zghoba
- '/var/lib/mysql/pxc-entrypoint.sh': Permission denied Error.K8SPXC-879
DNS resolution problem forces haproxy to remove all pxc nodes, including alive ones
Description
Environment
AFFECTED CS IDs
Activity
Nickolay IhalainenJune 14, 2023 at 4:25 PM
Hi @Slava Sarzhan, Thank you for the suggestion, it was helpful for the issue isolation.
The network policy allows to isolate only haproxy servers from dns and existing connections are able to access the database.
Scaling coredns deployment to zero pods disables DNS everywhere, including both haproxy and PXC pods.
PXC pods are not require dns communication for the galera communication, but every new connection is checked against reverse dns name. This feature is not useful for kubernetes due to the random nature of domain names of the real application and creates a redundant load on the DNS servers.
The permanent solution for the unstable DNS setups is disable such reverse lookup with skip-name-resolve mysql option:
https://dev.mysql.com/doc/refman/8.0/en/host-cache.html
pxc:
affinity:
antiAffinityTopologyKey: kubernetes.io/hostname
autoRecovery: true
configuration: |
[mysqld]
skip-name-resolve
I think we should mention this in a documentation and close the bug without peer-list changes applied to the main tree.
Slava SarzhanJune 12, 2023 at 3:05 PM
@Nickolay Ihalainen in my test I scale down kube-dns to 0 to check it.
Nickolay IhalainenJune 12, 2023 at 2:19 PM
The previous policy was incorrect (only dns allowed instead of allowing mysql
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-dns
spec:
podSelector:
matchLabels:
app.kubernetes.io/component: haproxy
policyTypes:
- Egress
egress:
# - to:
# - namespaceSelector:
# matchLabels:
# kubernetes.io/metadata.name: kube-system
# podSelector:
# matchLabels:
# k8s-app: kube-dns
# ports:
# - port: 53
# protocol: UDP
# - port: 53
# protocol: TCP
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: pxc
podSelector:
matchLabels:
app.kubernetes.io/part-of: percona-xtradb-cluster
Slava SarzhanJune 8, 2023 at 3:38 PM
@Nickolay Ihalainen I have tried to improve it but without any results. As I can see root of the issue is the liveness probe. As soon as I disable coredns the probe restarts the pod, and connection to DB is interrupted. I have played with haproxy config (use IPs instead of domain names, try to experiment with different options), but I did not find a configuration that can work without DNS.
Nickolay IhalainenJune 6, 2023 at 6:37 PM
Hi @Slava Sarzhan
I've made the same test with custom builds, now peer-list seems like works fine, but haproxy backend checks failing.
The test case is the same as before:
1. Create a connection to mysql via haproxy
2. execute queries in a loop using this connection (do not open new mysql connections)
3. Stop coredns or filter port 53 traffic with NetworkPolicy for haproxy pod
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-dns
spec:
podSelector:
matchLabels:
app.kubernetes.io/component: haproxy
policyTypes:
- Egress
egress:
- to:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: kube-system
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
bash-4.4$ while true ; do echo 'SELECT NOW();' ; sleep 1 ; done|mysql -N -uroot -p$MYSQL_ROOT_PASSWORD -h cluster1-haproxy
...
2023-06-06 17:57:32
2023-06-06 17:57:33
2023-06-06 17:57:34
ERROR 2013 (HY000) at line 153: Lost connection to MySQL server during query
XC node 10.42.2.15 for backend galera-mysqlx-nodes is ok
[WARNING] (1) : Process 924 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.1.7:33062' (111)
The following values are used for PXC node 10.42.1.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.1.7 for backend galera-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.2.15:33062' (111)
The following values are used for PXC node 10.42.2.15 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.2.15 for backend galera-nodes is not ok
[WARNING] (1) : Process 990 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-replica-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-replica-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.2.15:33062' (111)
The following values are used for PXC node 10.42.2.15 in backend galera-admin-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.2.15 for backend galera-admin-nodes is not ok
[WARNING] (1) : Process 1056 exited with code 0 (Exit)
[WARNING] (1) : Process 1079 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-nodes is not ok
[pod/cluster1-haproxy-0/haproxy] wsrep_local_state is 4; pxc_maint_mod is DISABLED; wsrep_cluster_status is Primary; 3 nodes are available
[pod/cluster1-haproxy-0/haproxy] PXC node 10.42.0.7 for backend galera-replica-nodes is ok
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 349 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 415 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 332
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 335 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 354
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 357 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 472 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 363
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 366 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 372
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 375 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 484 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 381
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 384 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 390
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 393 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 399
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 402 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 420
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 423 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 496 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 429
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 432 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 438
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 441 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 447
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 450 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 456
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 459 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 517 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 583 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 501
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : Server galera-replica-nodes/cluster1-pxc-1 is DOWN, reason: External check timeout, code: 0, check duration: 10002ms. 2 active and 0 backup servers left. 0 sess>
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 504 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 522
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : Server galera-replica-nodes/cluster1-pxc-2 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 1 active and 0 backup servers left. 0 sess>
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 525 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/pxc-monit] PXC node cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local for backend is ok
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-nodes/cluster1-pxc-1'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-admin-nodes/cluster1-pxc-1'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-nodes/cluster1-pxc-2'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-admin-nodes/cluster1-pxc-2'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + '[' -S /etc/haproxy/pxc/haproxy-main.sock ']'
[pod/cluster1-haproxy-0/pxc-monit] + echo reload
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy-main.sock
[pod/cluster1-haproxy-0/pxc-monit] + exit 0
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:55:55 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:44126->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:56:36 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:40623->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:57:17 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:37546->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:57:58 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:38746->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:34 lookup cluster1-pxc on 10.43.0.10:53: no such host
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 Peer list updated
[pod/cluster1-haproxy-0/pxc-monit] was []
[pod/cluster1-haproxy-0/pxc-monit] now [cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local]
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 execing: /usr/bin/add_pxc_nodes.sh with stdin: cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 Failed to execute /usr/bin/add_pxc_nodes.sh: + main
[pod/cluster1-haproxy-0/pxc-monit] + echo 'Running /usr/bin/add_pxc_nodes.sh'
[pod/cluster1-haproxy-0/pxc-monit] Running /usr/bin/add_pxc_nodes.sh
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_REPL=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_MYSQLX=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_ADMIN=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_BACKUP=()
[pod/cluster1-haproxy-0/pxc-monit] + firs_node=
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_admin=
[pod/cluster1-haproxy-0/pxc-monit] + main_node=
[pod/cluster1-haproxy-0/pxc-monit] + SERVER_OPTIONS='check inter 10000 rise 1 fall 2 weight 1'
[pod/cluster1-haproxy-0/pxc-monit] + send_proxy=
[pod/cluster1-haproxy-0/pxc-monit] + path_to_haproxy_cfg=/etc/haproxy/pxc
[pod/cluster1-haproxy-0/pxc-monit] + [[ '' = \y\e\s ]]
[pod/cluster1-haproxy-0/pxc-monit] + read pxc_host
[pod/cluster1-haproxy-0/pxc-monit] + '[' -z cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local ']'
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] ++ cut -d . -f -1
[pod/cluster1-haproxy-0/pxc-monit] + node_name=cluster1-pxc-0
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-0
[pod/cluster1-haproxy-0/pxc-monit] ++ awk F '{print $NF}'
[pod/cluster1-haproxy-0/pxc-monit] + node_id=0
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_REPL+=("server $node_name $pxc_host:3306 $send_proxy $SERVER_OPTIONS")
[pod/cluster1-haproxy-0/pxc-monit] + '[' x0 == x0 ']'
[pod/cluster1-haproxy-0/pxc-monit] + main_node=cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] + firs_node='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:3306 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_admin='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33062 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_mysqlx='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33060 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + continue
[pod/cluster1-haproxy-0/pxc-monit] + read pxc_host
[pod/cluster1-haproxy-0/pxc-monit] + '[' -z cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local ']'
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local
The test case is the same as in:
https://jira.percona.com/browse/K8SPXC-1216
If dead node runs core dns, it causes DNS timeouts:
kube-system coredns-7796b77cd4-nz9f9 1/1 Terminating 0 3h6m 10.42.1.2 k3d-ihanick-cluster1-agent-1 <none> <none> pxc cluster1-pxc-2 3/3 Terminating 0 139m 10.42.1.8 k3d-ihanick-cluster1-agent-1 <none> <none>
2023/03/10 14:58:34 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.0.6:55441->10.43.0.10:53: i/o timeout 2023/03/10 14:58:34 Peer list updated was [cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local] now []
And all haproxy nodes going down, while the cluster still alive (It has two pxc members in ready state):
kubectl -n pxc get pods NAME READY STATUS RESTARTS AGE percona-xtradb-cluster-operator-566848cf48-s6lm4 1/1 Running 0 3h4m cluster1-pxc-1 3/3 Running 0 136m cluster1-pxc-0 3/3 Running 0 136m cluster1-pxc-2 3/3 Running 0 138m cluster1-haproxy-0 1/2 Running 1 (2m50s ago) 143m cluster1-haproxy-2 1/2 Running 1 (2m50s ago) 144m cluster1-haproxy-1 1/2 Running 1 (2m50s ago) 144m
There is a similar https://jira.percona.com/browse/K8SPXC-953
But it shouldn't be related to the problem, because the fault is simulated by killing pxc+haproxy pods, not by putting k8s workers down.