Issues

Select view

Select search mode

 
47 of 47

DNS resolution problem forces haproxy to remove all pxc nodes, including alive ones

Done

Description

The test case is the same as in:
https://jira.percona.com/browse/K8SPXC-1216

If dead node runs core dns, it causes DNS timeouts:

kube-system   coredns-7796b77cd4-nz9f9                           1/1     Terminating   0             3h6m   10.42.1.2   k3d-ihanick-cluster1-agent-1    <none>           <none> pxc           cluster1-pxc-2                                     3/3     Terminating   0             139m   10.42.1.8   k3d-ihanick-cluster1-agent-1    <none>           <none>
2023/03/10 14:58:34 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.0.6:55441->10.43.0.10:53: i/o timeout 2023/03/10 14:58:34 Peer list updated was [cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local cluster1-pxc-1.cluster1-pxc.pxc.svc.cluster.local cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local] now []

And all haproxy nodes going down, while the cluster still alive (It has two pxc members in ready state):

kubectl -n pxc get pods NAME                                               READY   STATUS    RESTARTS        AGE percona-xtradb-cluster-operator-566848cf48-s6lm4   1/1     Running   0               3h4m cluster1-pxc-1                                     3/3     Running   0               136m cluster1-pxc-0                                     3/3     Running   0               136m cluster1-pxc-2                                     3/3     Running   0               138m cluster1-haproxy-0                                 1/2     Running   1 (2m50s ago)   143m cluster1-haproxy-2                                 1/2     Running   1 (2m50s ago)   144m cluster1-haproxy-1                                 1/2     Running   1 (2m50s ago)   144m

There is a similar https://jira.percona.com/browse/K8SPXC-953
But it shouldn't be related to the problem, because the fault is simulated by killing pxc+haproxy pods, not by putting k8s workers down.

Environment

None

AFFECTED CS IDs

CS0034556

Details

Assignee

Reporter

Needs QA

Yes

Needs Doc

Yes

Fix versions

Affects versions

Priority

Smart Checklist

Created March 10, 2023 at 3:20 PM
Updated March 5, 2024 at 5:27 PM
Resolved July 11, 2023 at 6:37 PM

Activity

Show:

Nickolay IhalainenJune 14, 2023 at 4:25 PM

Hi , Thank you for the suggestion, it was helpful for the issue isolation.

The network policy allows to isolate only haproxy servers from dns and existing connections are able to access the database.
Scaling coredns deployment to zero pods disables DNS everywhere, including both haproxy and PXC pods.
PXC pods are not require dns communication for the galera communication, but every new connection is checked against reverse dns name. This feature is not useful for kubernetes due to the random nature of domain names of the real application and creates a redundant load on the DNS servers.

The permanent solution for the unstable DNS setups is disable such reverse lookup with skip-name-resolve mysql option:
https://dev.mysql.com/doc/refman/8.0/en/host-cache.html

pxc: affinity: antiAffinityTopologyKey: kubernetes.io/hostname autoRecovery: true configuration: | [mysqld] skip-name-resolve

I think we should mention this in a documentation and close the bug without peer-list changes applied to the main tree.

Slava SarzhanJune 12, 2023 at 3:05 PM

in my test I scale down kube-dns to 0 to check it.

Nickolay IhalainenJune 12, 2023 at 2:19 PM

The previous policy was incorrect (only dns allowed instead of allowing mysql

apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: deny-dns spec: podSelector: matchLabels: app.kubernetes.io/component: haproxy policyTypes: - Egress egress: # - to: # - namespaceSelector: # matchLabels: # kubernetes.io/metadata.name: kube-system # podSelector: # matchLabels: # k8s-app: kube-dns # ports: # - port: 53 # protocol: UDP # - port: 53 # protocol: TCP - to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: pxc podSelector: matchLabels: app.kubernetes.io/part-of: percona-xtradb-cluster

Slava SarzhanJune 8, 2023 at 3:38 PM

I have tried to improve it but without any results. As I can see root of the issue is the liveness probe. As soon as I disable coredns the probe restarts the pod, and connection to DB is interrupted. I have played with haproxy config (use IPs instead of domain names, try to experiment with different options), but I did not find a configuration that can work without DNS. 

Nickolay IhalainenJune 6, 2023 at 6:37 PM

Hi

I've made the same test with custom builds, now peer-list seems like works fine, but haproxy backend checks failing.

The test case is the same as before:
1. Create a connection to mysql via haproxy
2. execute queries in a loop using this connection (do not open new mysql connections)
3. Stop coredns or filter port 53 traffic with NetworkPolicy for haproxy pod

apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: deny-dns spec: podSelector: matchLabels: app.kubernetes.io/component: haproxy policyTypes: - Egress egress: - to: - namespaceSelector: matchLabels: kubernetes.io/metadata.name: kube-system podSelector: matchLabels: k8s-app: kube-dns ports: - port: 53 protocol: UDP - port: 53 protocol: TCP
bash-4.4$ while true ; do echo 'SELECT NOW();' ; sleep 1 ; done|mysql -N -uroot -p$MYSQL_ROOT_PASSWORD -h cluster1-haproxy ... 2023-06-06 17:57:32 2023-06-06 17:57:33 2023-06-06 17:57:34 ERROR 2013 (HY000) at line 153: Lost connection to MySQL server during query

XC node 10.42.2.15 for backend galera-mysqlx-nodes is ok
[WARNING] (1) : Process 924 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.1.7:33062' (111)
The following values are used for PXC node 10.42.1.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.1.7 for backend galera-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.2.15:33062' (111)
The following values are used for PXC node 10.42.2.15 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.2.15 for backend galera-nodes is not ok
[WARNING] (1) : Process 990 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-replica-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-replica-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.2.15:33062' (111)
The following values are used for PXC node 10.42.2.15 in backend galera-admin-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.2.15 for backend galera-admin-nodes is not ok
[WARNING] (1) : Process 1056 exited with code 0 (Exit)
[WARNING] (1) : Process 1079 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-nodes is not ok

[pod/cluster1-haproxy-0/haproxy] wsrep_local_state is 4; pxc_maint_mod is DISABLED; wsrep_cluster_status is Primary; 3 nodes are available
[pod/cluster1-haproxy-0/haproxy] PXC node 10.42.0.7 for backend galera-replica-nodes is ok
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 349 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 415 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 332
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 335 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 354
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 357 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 472 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 363
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 366 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 372
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 375 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 484 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 381
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 384 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 390
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 393 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 399
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 402 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 420
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 423 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 496 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 429
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 432 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 438
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 441 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 447
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 450 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 456
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 459 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 517 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 583 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 501
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : Server galera-replica-nodes/cluster1-pxc-1 is DOWN, reason: External check timeout, code: 0, check duration: 10002ms. 2 active and 0 backup servers left. 0 sess>
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 504 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 522
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : Server galera-replica-nodes/cluster1-pxc-2 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 1 active and 0 backup servers left. 0 sess>
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 525 exited with code 0 (Exit)

[pod/cluster1-haproxy-0/pxc-monit] PXC node cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local for backend is ok
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-nodes/cluster1-pxc-1'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-admin-nodes/cluster1-pxc-1'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-nodes/cluster1-pxc-2'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-admin-nodes/cluster1-pxc-2'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + '[' -S /etc/haproxy/pxc/haproxy-main.sock ']'
[pod/cluster1-haproxy-0/pxc-monit] + echo reload
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy-main.sock
[pod/cluster1-haproxy-0/pxc-monit] + exit 0
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:55:55 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:44126->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:56:36 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:40623->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:57:17 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:37546->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:57:58 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:38746->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:34 lookup cluster1-pxc on 10.43.0.10:53: no such host
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 Peer list updated
[pod/cluster1-haproxy-0/pxc-monit] was []
[pod/cluster1-haproxy-0/pxc-monit] now [cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local]
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 execing: /usr/bin/add_pxc_nodes.sh with stdin: cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 Failed to execute /usr/bin/add_pxc_nodes.sh: + main
[pod/cluster1-haproxy-0/pxc-monit] + echo 'Running /usr/bin/add_pxc_nodes.sh'
[pod/cluster1-haproxy-0/pxc-monit] Running /usr/bin/add_pxc_nodes.sh
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_REPL=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_MYSQLX=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_ADMIN=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_BACKUP=()
[pod/cluster1-haproxy-0/pxc-monit] + firs_node=
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_admin=
[pod/cluster1-haproxy-0/pxc-monit] + main_node=
[pod/cluster1-haproxy-0/pxc-monit] + SERVER_OPTIONS='check inter 10000 rise 1 fall 2 weight 1'
[pod/cluster1-haproxy-0/pxc-monit] + send_proxy=
[pod/cluster1-haproxy-0/pxc-monit] + path_to_haproxy_cfg=/etc/haproxy/pxc
[pod/cluster1-haproxy-0/pxc-monit] + [[ '' = \y\e\s ]]
[pod/cluster1-haproxy-0/pxc-monit] + read pxc_host
[pod/cluster1-haproxy-0/pxc-monit] + '[' -z cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local ']'
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] ++ cut -d . -f -1
[pod/cluster1-haproxy-0/pxc-monit] + node_name=cluster1-pxc-0
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-0
[pod/cluster1-haproxy-0/pxc-monit] ++ awk F '{print $NF}'
[pod/cluster1-haproxy-0/pxc-monit] + node_id=0
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_REPL+=("server $node_name $pxc_host:3306 $send_proxy $SERVER_OPTIONS")
[pod/cluster1-haproxy-0/pxc-monit] + '[' x0 == x0 ']'
[pod/cluster1-haproxy-0/pxc-monit] + main_node=cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] + firs_node='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:3306 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_admin='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33062 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_mysqlx='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33060 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + continue
[pod/cluster1-haproxy-0/pxc-monit] + read pxc_host
[pod/cluster1-haproxy-0/pxc-monit] + '[' -z cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local ']'
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local