pod becomes unavailable but not restarted automatically

Description

After a specific failure (see log)

The pod becomes unvailable, this is from kubectl get pods:

 

kubectl get pods -n pxc -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cl2-haproxy-0 2/2 Running 3 84m 192.168.61.133 beast-node7-ubuntu <none> <none> cl2-haproxy-1 2/2 Running 3 82m 192.168.71.196 beast-node8-ubuntu <none> <none> cl2-haproxy-2 2/2 Running 3 82m 192.168.66.6 beast-node6-ubuntu <none> <none> cl2-pxc-0 1/1 Running 0 79m 192.168.66.7 beast-node6-ubuntu <none> <none> cl2-pxc-1 0/1 Running 0 83m 192.168.61.134 beast-node7-ubuntu <none> <none> cl2-pxc-2 1/1 Running 3 81m 192.168.71.197 beast-node8-ubuntu <none> <none>

 

We can see that pod cl2-pxc-1 is not ready but it is not being restarted

kubectl describe:

 

k describe po/cl2-pxc-1 Name: cl2-pxc-1 Namespace: pxc Priority: 0 Node: beast-node7-ubuntu/172.16.0.15 Start Time: Wed, 07 Oct 2020 11:14:08 -0400 Labels: app.kubernetes.io/component=pxc app.kubernetes.io/instance=cl2 app.kubernetes.io/managed-by=percona-xtradb-cluster-operator app.kubernetes.io/name=percona-xtradb-cluster app.kubernetes.io/part-of=percona-xtradb-cluster controller-revision-hash=cl2-pxc-69cfd8579f statefulset.kubernetes.io/pod-name=cl2-pxc-1 Annotations: cni.projectcalico.org/podIP: 192.168.61.134/32 percona.com/configuration-hash: d41d8cd98f00b204e9800998ecf8427e percona.com/ssl-hash: ee931e5aedf277184d31dcce4214d637 percona.com/ssl-internal-hash: c9712c413646a7be0b49b92f55996b97 Status: Running IP: 192.168.61.134 IPs: IP: 192.168.61.134 Controlled By: StatefulSet/cl2-pxc Init Containers: pxc-init: Container ID: docker://e135793ff2fae5a0e8c796a2501c2fe0ad0ef9dcfad4e0deedf38e599700c1a7 Image: percona/percona-xtradb-cluster-operator:1.6.0 Image ID: docker-pullable://percona/percona-xtradb-cluster-operator@sha256:4ce6c8a55d8ed3a60c96c406ee103f70d303ebd97237e53d0e38fde75f848683 Port: <none> Host Port: <none> Command: /pxc-init-entrypoint.sh State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 07 Oct 2020 11:14:10 -0400 Finished: Wed, 07 Oct 2020 11:14:10 -0400 Ready: True Restart Count: 0 Requests: cpu: 600m memory: 1G Environment: <none> Mounts: /var/lib/mysql from datadir (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-l7m9h (ro) pxc-init-unsafe: Container ID: docker://dadb5bb3f034a6b9c081d128e8c52aa9591d42b88ec4d14ff3a3b9e4fdaadda3 Image: percona/percona-xtradb-cluster:8.0.20-11.1 Image ID: docker-pullable://percona/percona-xtradb-cluster@sha256:54b1b2f5153b78b05d651034d4603a13e685cbb9b45bfa09a39864fa3f169349 Ports: 3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP, 33062/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP Command: /var/lib/mysql/unsafe-bootstrap.sh Args: mysqld State: Terminated Reason: Completed Exit Code: 0 Started: Wed, 07 Oct 2020 11:14:11 -0400 Finished: Wed, 07 Oct 2020 11:14:11 -0400 Ready: True Restart Count: 0 Requests: cpu: 600m memory: 1G Environment: PXC_SERVICE: cl2-pxc-unready MONITOR_HOST: % MYSQL_ROOT_PASSWORD: <set to the key 'root' in secret 'internal-cl2'> Optional: false XTRABACKUP_PASSWORD: <set to the key 'xtrabackup' in secret 'internal-cl2'> Optional: false MONITOR_PASSWORD: <set to the key 'monitor' in secret 'internal-cl2'> Optional: false CLUSTERCHECK_PASSWORD: <set to the key 'clustercheck' in secret 'internal-cl2'> Optional: false OPERATOR_ADMIN_PASSWORD: <set to the key 'operator' in secret 'internal-cl2'> Optional: false Mounts: /etc/my.cnf.d from auto-config (rw) /etc/mysql/ssl from ssl (rw) /etc/mysql/ssl-internal from ssl-internal (rw) /etc/mysql/vault-keyring-secret from vault-keyring-secret (rw) /etc/percona-xtradb-cluster.conf.d from config (rw) /tmp from tmp (rw) /var/lib/mysql from datadir (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-l7m9h (ro) Containers: pxc: Container ID: docker://58f0ae31d9308214cad4934fd7b66acbfa916f377e9ad4b3a5a748eed794306c Image: percona/percona-xtradb-cluster:8.0.20-11.1 Image ID: docker-pullable://percona/percona-xtradb-cluster@sha256:54b1b2f5153b78b05d651034d4603a13e685cbb9b45bfa09a39864fa3f169349 Ports: 3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP, 33062/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP Command: /var/lib/mysql/pxc-entrypoint.sh Args: mysqld State: Running Started: Wed, 07 Oct 2020 11:14:12 -0400 Ready: False Restart Count: 0 Requests: cpu: 600m memory: 1G Liveness: exec [/var/lib/mysql/liveness-check.sh] delay=300s timeout=5s period=10s #success=1 #failure=3 Readiness: exec [/var/lib/mysql/readiness-check.sh] delay=15s timeout=15s period=30s #success=1 #failure=5 Environment: PXC_SERVICE: cl2-pxc-unready MONITOR_HOST: % MYSQL_ROOT_PASSWORD: <set to the key 'root' in secret 'internal-cl2'> Optional: false XTRABACKUP_PASSWORD: <set to the key 'xtrabackup' in secret 'internal-cl2'> Optional: false MONITOR_PASSWORD: <set to the key 'monitor' in secret 'internal-cl2'> Optional: false CLUSTERCHECK_PASSWORD: <set to the key 'clustercheck' in secret 'internal-cl2'> Optional: false OPERATOR_ADMIN_PASSWORD: <set to the key 'operator' in secret 'internal-cl2'> Optional: false Mounts: /etc/my.cnf.d from auto-config (rw) /etc/mysql/ssl from ssl (rw) /etc/mysql/ssl-internal from ssl-internal (rw) /etc/mysql/vault-keyring-secret from vault-keyring-secret (rw) /etc/percona-xtradb-cluster.conf.d from config (rw) /tmp from tmp (rw) /var/lib/mysql from datadir (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-l7m9h (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: datadir: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: datadir-cl2-pxc-1 ReadOnly: false tmp: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> config: Type: ConfigMap (a volume populated by a ConfigMap) Name: cl2-pxc Optional: true ssl-internal: Type: Secret (a volume populated by a Secret) SecretName: my-cluster-ssl-internal Optional: true ssl: Type: Secret (a volume populated by a Secret) SecretName: my-cluster-ssl Optional: false auto-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: auto-cl2-pxc Optional: true vault-keyring-secret: Type: Secret (a volume populated by a Secret) SecretName: keyring-secret-vault Optional: true default-token-l7m9h: Type: Secret (a volume populated by a Secret) SecretName: default-token-l7m9h Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 53m kubelet, beast-node7-ubuntu Readiness probe failed: + [[ Primary == \P\r\i\m\a\r\y ]] + [[ 5 -eq 4 ]] + [[ 5 -eq 2 ]] + exit 1 Warning Unhealthy 110s (x103 over 52m) kubelet, beast-node7-ubuntu Readiness probe failed: + [[ Disconnected == \P\r\i\m\a\r\y ]] + exit 1

 

log of node failure:

 

2020-10-07T15:18:14.093934Z 0 [Note] [MY-000000] [Galera] async IST sender served 2020-10-07T15:18:14.098880Z 0 [Note] [MY-000000] [Galera] 1.0 (cl2-pxc-0): State transfer from 2.0 (cl2-pxc-1) complete. 2020-10-07T15:18:14.099847Z 0 [Note] [MY-000000] [Galera] Member 1.0 (cl2-pxc-0) synced with group. 2020-10-07T15:42:51.252716Z 0 [Note] [MY-000000] [Galera] declaring 4d930040 at ssl://192.168.66.7:4567 stable 2020-10-07T15:42:51.252813Z 0 [Note] [MY-000000] [Galera] forgetting 0161f386 (ssl://192.168.71.197:4567) 2020-10-07T15:42:51.253609Z 0 [Note] [MY-000000] [Galera] Node 4d930040 state primary 2020-10-07T15:42:51.257245Z 2 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Update_rows event on table sbtest.warehouse9; Can't find record in 'warehouse9', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 0, Error_code: MY-001032 2020-10-07T15:42:51.257277Z 2 [Warning] [MY-000000] [WSREP] Event 3 Update_rows apply failed: 120, seqno 76599 2020-10-07T15:42:51.257968Z 2 [Note] [MY-000000] [Galera] Failed to apply write set: gtid: 9eb2648f-08af-11eb-ab37-7fa7bf32cdd9:76599 server_id: 4d930040-08b0-11eb-b309-5af603144c68 client_id: 2349 trx_id: 86802 flags: 3 2020-10-07T15:42:51.258561Z 2 [Note] [MY-000000] [Galera] Closing send monitor... 2020-10-07T15:42:51.258580Z 2 [Note] [MY-000000] [Galera] Closed send monitor. 2020-10-07T15:42:51.258594Z 2 [Note] [MY-000000] [Galera] gcomm: terminating thread 2020-10-07T15:42:51.258611Z 2 [Note] [MY-000000] [Galera] gcomm: joining thread 2020-10-07T15:42:51.258724Z 2 [Note] [MY-000000] [Galera] gcomm: closing backend 2020-10-07T15:42:51.290386Z 10 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Update_rows event on table sbtest.warehouse5; Can't find record in 'warehouse5', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 0, Error_code: MY-001032 2020-10-07T15:42:51.290412Z 10 [Warning] [MY-000000] [WSREP] Event 3 Update_rows apply failed: 120, seqno 76600 2020-10-07T15:42:51.290898Z 10 [Note] [MY-000000] [Galera] Failed to apply write set: gtid: 9eb2648f-08af-11eb-ab37-7fa7bf32cdd9:76600 server_id: 4d930040-08b0-11eb-b309-5af603144c68 client_id: 2388 trx_id: 86791 flags: 3 2020-10-07T15:42:51.324996Z 2 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node view (view_id(PRIM,4d930040,6) memb { 4d930040,0 c2b193c8,0 } joined { } left { } partitioned { 0161f386,0 } ) 2020-10-07T15:42:51.325043Z 2 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk 2020-10-07T15:42:51.325822Z 2 [Note] [MY-000000] [Galera] forgetting 0161f386 (ssl://192.168.71.197:4567) 2020-10-07T15:42:52.327208Z 2 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node view (view_id(NON_PRIM,4d930040,6) memb { c2b193c8,0 } joined { } left { } partitioned { 4d930040,0 } ) 2020-10-07T15:42:52.327236Z 2 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0 2020-10-07T15:42:52.327260Z 2 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node view ((empty)) 2020-10-07T15:42:52.327430Z 2 [Note] [MY-000000] [Galera] gcomm: closed 2020-10-07T15:42:52.327552Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2 2020-10-07T15:42:52.327630Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID. 2020-10-07T15:42:52.327723Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1 2020-10-07T15:42:52.327821Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [100, 100] 2020-10-07T15:42:52.327847Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY. 2020-10-07T15:42:52.327866Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED -> OPEN (TO: 76659) 2020-10-07T15:42:52.327902Z 0 [Note] [MY-000000] [Galera] New SELF-LEAVE. 2020-10-07T15:42:52.327952Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [0, 0] 2020-10-07T15:42:52.327980Z 0 [Note] [MY-000000] [Galera] Received SELF-LEAVE. Closing connection. 2020-10-07T15:42:52.328003Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> CLOSED (TO: -1) 2020-10-07T15:42:52.328025Z 0 [Note] [MY-000000] [Galera] RECV thread exiting 0: Success 2020-10-07T15:42:52.328048Z 10 [Note] [MY-000000] [Galera] ####### processing CC -1, local, ordered 2020-10-07T15:42:52.328114Z 10 [Note] [MY-000000] [Galera] ####### My UUID: c2b193c8-08af-11eb-b3f1-6a3d4c9d4080 2020-10-07T15:42:52.328155Z 2 [Note] [MY-000000] [Galera] recv_thread() joined. 2020-10-07T15:42:52.328156Z 10 [Note] [MY-000000] [Galera] ####### ST not required 2020-10-07T15:42:52.328165Z 2 [Note] [MY-000000] [Galera] Closing replication queue. 2020-10-07T15:42:52.328187Z 2 [Note] [MY-000000] [Galera] Closing slave action queue. 2020-10-07T15:42:52.328242Z 10 [Note] [MY-000000] [Galera] ================================================ View: id: 9eb2648f-08af-11eb-ab37-7fa7bf32cdd9:-1 status: non-primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: no own_index: 0 members(1): 0: c2b193c8-08af-11eb-b3f1-6a3d4c9d4080, cl2-pxc-1 ================================================= 2020-10-07T15:42:52.328271Z 10 [Note] [MY-000000] [Galera] Non-primary view 2020-10-07T15:42:52.328291Z 10 [Note] [MY-000000] [WSREP] Server status change synced -> connected 2020-10-07T15:42:52.329196Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification. 2020-10-07T15:42:52.331451Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification. 2020-10-07T15:42:52.331537Z 10 [Note] [MY-000000] [Galera] ####### processing CC -1, local, ordered 2020-10-07T15:42:52.331573Z 10 [Note] [MY-000000] [Galera] ####### My UUID: c2b193c8-08af-11eb-b3f1-6a3d4c9d4080 2020-10-07T15:42:52.331595Z 10 [Note] [MY-000000] [Galera] ####### ST not required 2020-10-07T15:42:52.331632Z 10 [Note] [MY-000000] [Galera] ================================================ View: id: 9eb2648f-08af-11eb-ab37-7fa7bf32cdd9:-1 status: non-primary protocol_version: 4 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO final: yes own_index: -1 members(0): ================================================= 2020-10-07T15:42:52.331658Z 10 [Note] [MY-000000] [Galera] Non-primary view 2020-10-07T15:42:52.331677Z 10 [Note] [MY-000000] [WSREP] Server status change connected -> disconnected 2020-10-07T15:42:52.331699Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification. 2020-10-07T15:42:52.331721Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification. 2020-10-07T15:42:52.338267Z 2 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 2 2020-10-07T15:42:52.342107Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed. 2020-10-07T15:42:52.342187Z 10 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5 2020-10-07T15:42:52.342234Z 10 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 0 thd: 10 2020-10-07T15:43:59.501050Z 1579 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known 2020-10-07T15:44:08.665507Z 1586 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known 2020-10-07T15:44:18.710482Z 1595 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known 2020-10-07T15:44:22.049759Z 1599 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known 2020-10-07T15:44:25.430895Z 1602 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known

2020-10-07T15:44:28.761998Z 1605 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known

Environment

None

Smart Checklist

Activity

Slava Sarzhan December 18, 2020 at 3:12 PM

P.S. I can reproduce this specific case but I need to know how. Please ping me if it is needed.

Slava Sarzhan December 18, 2020 at 3:09 PM

Hi  , I have found  under the task https://perconadev.atlassian.net/browse/K8SPXC-564#icft=K8SPXC-564 that our liveness probe works incorrectly(if  'sst_in_progress' file exists we can not restart the pod because sst is in progress, but we had "progress=$DATADIR/sst_in_progress"  option in node.cnf and this file is created on donor node but is never deleted).  This issue will be fixed under https://perconadev.atlassian.net/browse/K8SPXC-564#icft=K8SPXC-564

Duplicate

Details

Assignee

Reporter

Time tracking

30m logged

Priority

Smart Checklist

Created October 7, 2020 at 4:38 PM
Updated March 5, 2024 at 6:05 PM
Resolved December 18, 2020 at 3:13 PM

Flag notifications