DNS resolution problem forces haproxy to remove all pxc nodes, including alive ones

General

Escalation

General

Escalation

Description

The test case is the same as in:
https://jira.percona.com/browse/K8SPXC-1216

If dead node runs core dns, it causes DNS timeouts:

And all haproxy nodes going down, while the cluster still alive (It has two pxc members in ready state):

There is a similar https://jira.percona.com/browse/K8SPXC-953
But it shouldn't be related to the problem, because the fault is simulated by killing pxc+haproxy pods, not by putting k8s workers down.

Environment

None

AFFECTED CS IDs

CS0034556

Activity

Nickolay Ihalainen June 14, 2023 at 4:25 PM

Hi , Thank you for the suggestion, it was helpful for the issue isolation.

The network policy allows to isolate only haproxy servers from dns and existing connections are able to access the database.
Scaling coredns deployment to zero pods disables DNS everywhere, including both haproxy and PXC pods.
PXC pods are not require dns communication for the galera communication, but every new connection is checked against reverse dns name. This feature is not useful for kubernetes due to the random nature of domain names of the real application and creates a redundant load on the DNS servers.

The permanent solution for the unstable DNS setups is disable such reverse lookup with skip-name-resolve mysql option:
https://dev.mysql.com/doc/refman/8.0/en/host-cache.html

I think we should mention this in a documentation and close the bug without peer-list changes applied to the main tree.

Slava Sarzhan June 12, 2023 at 3:05 PM

in my test I scale down kube-dns to 0 to check it.

Nickolay Ihalainen June 12, 2023 at 2:19 PM

The previous policy was incorrect (only dns allowed instead of allowing mysql

Slava Sarzhan June 8, 2023 at 3:38 PM

I have tried to improve it but without any results. As I can see root of the issue is the liveness probe. As soon as I disable coredns the probe restarts the pod, and connection to DB is interrupted. I have played with haproxy config (use IPs instead of domain names, try to experiment with different options), but I did not find a configuration that can work without DNS.

Nickolay Ihalainen June 6, 2023 at 6:37 PM

I've made the same test with custom builds, now peer-list seems like works fine, but haproxy backend checks failing.

The test case is the same as before:
1. Create a connection to mysql via haproxy
2. execute queries in a loop using this connection (do not open new mysql connections)
3. Stop coredns or filter port 53 traffic with NetworkPolicy for haproxy pod

XC node 10.42.2.15 for backend galera-mysqlx-nodes is ok
[WARNING] (1) : Process 924 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.1.7:33062' (111)
The following values are used for PXC node 10.42.1.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.1.7 for backend galera-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.2.15:33062' (111)
The following values are used for PXC node 10.42.2.15 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.2.15 for backend galera-nodes is not ok
[WARNING] (1) : Process 990 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-replica-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-replica-nodes is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.2.15:33062' (111)
The following values are used for PXC node 10.42.2.15 in backend galera-admin-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.2.15 for backend galera-admin-nodes is not ok
[WARNING] (1) : Process 1056 exited with code 0 (Exit)
[WARNING] (1) : Process 1079 exited with code 0 (Exit)
ERROR 2003 (HY000): Can't connect to MySQL server on '10.42.0.7:33062' (111)
The following values are used for PXC node 10.42.0.7 in backend galera-nodes:
wsrep_local_state is ; pxc_maint_mod is ; wsrep_cluster_status is ; 3 nodes are available
PXC node 10.42.0.7 for backend galera-nodes is not ok

[pod/cluster1-haproxy-0/haproxy] wsrep_local_state is 4; pxc_maint_mod is DISABLED; wsrep_cluster_status is Primary; 3 nodes are available
[pod/cluster1-haproxy-0/haproxy] PXC node 10.42.0.7 for backend galera-replica-nodes is ok
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 349 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 415 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 332
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 335 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 354
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 357 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 472 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 363
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 366 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 372
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 375 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 484 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 381
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 384 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 390
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 393 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 399
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 402 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 420
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 423 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 496 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 429
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 432 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 438
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 441 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 447
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 450 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 456
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 459 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 517 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 583 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 501
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : Server galera-replica-nodes/cluster1-pxc-1 is DOWN, reason: External check timeout, code: 0, check duration: 10002ms. 2 active and 0 backup servers left. 0 sess>
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 504 exited with code 0 (Exit)
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : kill 522
[pod/cluster1-haproxy-0/haproxy] [WARNING] (19) : Server galera-replica-nodes/cluster1-pxc-2 is DOWN, reason: External check timeout, code: 0, check duration: 10001ms. 1 active and 0 backup servers left. 0 sess>
[pod/cluster1-haproxy-0/haproxy] [WARNING] (1) : Process 525 exited with code 0 (Exit)

[pod/cluster1-haproxy-0/pxc-monit] PXC node cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local for backend is ok
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-nodes/cluster1-pxc-1'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-admin-nodes/cluster1-pxc-1'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-nodes/cluster1-pxc-2'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + for backup_server in ${NODE_LIST_BACKUP[@]}
[pod/cluster1-haproxy-0/pxc-monit] + echo 'shutdown sessions server galera-admin-nodes/cluster1-pxc-2'
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy.sock
[pod/cluster1-haproxy-0/pxc-monit] No such server.
[pod/cluster1-haproxy-0/pxc-monit]
[pod/cluster1-haproxy-0/pxc-monit] + '[' -S /etc/haproxy/pxc/haproxy-main.sock ']'
[pod/cluster1-haproxy-0/pxc-monit] + echo reload
[pod/cluster1-haproxy-0/pxc-monit] + socat stdio /etc/haproxy/pxc/haproxy-main.sock
[pod/cluster1-haproxy-0/pxc-monit] + exit 0
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:55:55 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:44126->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:56:36 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:40623->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:57:17 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:37546->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:57:58 lookup cluster1-pxc on 10.43.0.10:53: read udp 10.42.1.8:38746->10.43.0.10:53: i/o timeout
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:34 lookup cluster1-pxc on 10.43.0.10:53: no such host
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 Peer list updated
[pod/cluster1-haproxy-0/pxc-monit] was []
[pod/cluster1-haproxy-0/pxc-monit] now [cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local]
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 execing: /usr/bin/add_pxc_nodes.sh with stdin: cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] 2023/06/06 17:58:35 Failed to execute /usr/bin/add_pxc_nodes.sh: + main
[pod/cluster1-haproxy-0/pxc-monit] + echo 'Running /usr/bin/add_pxc_nodes.sh'
[pod/cluster1-haproxy-0/pxc-monit] Running /usr/bin/add_pxc_nodes.sh
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_REPL=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_MYSQLX=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_ADMIN=()
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_BACKUP=()
[pod/cluster1-haproxy-0/pxc-monit] + firs_node=
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_admin=
[pod/cluster1-haproxy-0/pxc-monit] + main_node=
[pod/cluster1-haproxy-0/pxc-monit] + SERVER_OPTIONS='check inter 10000 rise 1 fall 2 weight 1'
[pod/cluster1-haproxy-0/pxc-monit] + send_proxy=
[pod/cluster1-haproxy-0/pxc-monit] + path_to_haproxy_cfg=/etc/haproxy/pxc
[pod/cluster1-haproxy-0/pxc-monit] + [[ '' = \y\e\s ]]
[pod/cluster1-haproxy-0/pxc-monit] + read pxc_host
[pod/cluster1-haproxy-0/pxc-monit] + '[' -z cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local ']'
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] ++ cut -d . -f -1
[pod/cluster1-haproxy-0/pxc-monit] + node_name=cluster1-pxc-0
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-0
[pod/cluster1-haproxy-0/pxc-monit] ++ awk F '{print $NF}'
[pod/cluster1-haproxy-0/pxc-monit] + node_id=0
[pod/cluster1-haproxy-0/pxc-monit] + NODE_LIST_REPL+=("server $node_name $pxc_host:3306 $send_proxy $SERVER_OPTIONS")
[pod/cluster1-haproxy-0/pxc-monit] + '[' x0 == x0 ']'
[pod/cluster1-haproxy-0/pxc-monit] + main_node=cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local
[pod/cluster1-haproxy-0/pxc-monit] + firs_node='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:3306 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_admin='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33062 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + firs_node_mysqlx='server cluster1-pxc-0 cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:33060 check inter 10000 rise 1 fall 2 weight 1 on-marked-up shutdown-backup-sessions'
[pod/cluster1-haproxy-0/pxc-monit] + continue
[pod/cluster1-haproxy-0/pxc-monit] + read pxc_host
[pod/cluster1-haproxy-0/pxc-monit] + '[' -z cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local ']'
[pod/cluster1-haproxy-0/pxc-monit] ++ echo cluster1-pxc-2.cluster1-pxc.pxc.svc.cluster.local

Done

Details
Assignee
ege.gunes
Reporter
Nickolay Ihalainen(Deactivated)
Needs QA
Yes
Needs Doc
Yes
Fix versions
1.13.0
Affects versions
1.10.0
Priority
Critical

Smart Checklist

Created March 10, 2023 at 3:20 PM

Updated March 5, 2024 at 5:27 PM

Resolved July 11, 2023 at 6:37 PM

DNS resolution problem forces haproxy to remove all pxc nodes, including alive ones

Description

Environment

AFFECTED CS IDs

Activity

Nickolay Ihalainen June 14, 2023 at 4:25 PM

Slava Sarzhan June 12, 2023 at 3:05 PM

Nickolay Ihalainen June 12, 2023 at 2:19 PM

Slava Sarzhan June 8, 2023 at 3:38 PM

Nickolay Ihalainen June 6, 2023 at 6:37 PM

DetailsAssigneeege.gunesege.gunesReporterNickolay IhalainenNickolay Ihalainen(Deactivated)Needs QAYesNeeds DocYesFix versions1.13.0Affects versions1.10.0PriorityCritical

Details

Assignee

Reporter

Needs QA

Needs Doc

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Details
Assignee
ege.gunes
Reporter
Nickolay Ihalainen(Deactivated)
Needs QA
Yes
Needs Doc
Yes
Fix versions
1.13.0
Affects versions
1.10.0
Priority
Critical

Smart Checklist