pod becomes unavailable but not restarted automatically

General

Escalation

General

Escalation

Description

After a specific failure (see log)

The pod becomes unvailable, this is from kubectl get pods:

kubectl get pods -n pxc -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cl2-haproxy-0 2/2 Running 3 84m 192.168.61.133 beast-node7-ubuntu <none> <none>
cl2-haproxy-1 2/2 Running 3 82m 192.168.71.196 beast-node8-ubuntu <none> <none>
cl2-haproxy-2 2/2 Running 3 82m 192.168.66.6 beast-node6-ubuntu <none> <none>
cl2-pxc-0 1/1 Running 0 79m 192.168.66.7 beast-node6-ubuntu <none> <none>
cl2-pxc-1 0/1 Running 0 83m 192.168.61.134 beast-node7-ubuntu <none> <none>
cl2-pxc-2 1/1 Running 3 81m 192.168.71.197 beast-node8-ubuntu <none> <none>

We can see that pod cl2-pxc-1 is not ready but it is not being restarted

kubectl describe:

k describe po/cl2-pxc-1 
Name: cl2-pxc-1
Namespace: pxc
Priority: 0
Node: beast-node7-ubuntu/172.16.0.15
Start Time: Wed, 07 Oct 2020 11:14:08 -0400
Labels: app.kubernetes.io/component=pxc
 app.kubernetes.io/instance=cl2
 app.kubernetes.io/managed-by=percona-xtradb-cluster-operator
 app.kubernetes.io/name=percona-xtradb-cluster
 app.kubernetes.io/part-of=percona-xtradb-cluster
 controller-revision-hash=cl2-pxc-69cfd8579f
 statefulset.kubernetes.io/pod-name=cl2-pxc-1
Annotations: cni.projectcalico.org/podIP: 192.168.61.134/32
 percona.com/configuration-hash: d41d8cd98f00b204e9800998ecf8427e
 percona.com/ssl-hash: ee931e5aedf277184d31dcce4214d637
 percona.com/ssl-internal-hash: c9712c413646a7be0b49b92f55996b97
Status: Running
IP: 192.168.61.134
IPs:
 IP: 192.168.61.134
Controlled By: StatefulSet/cl2-pxc
Init Containers:
 pxc-init:
 Container ID: docker://e135793ff2fae5a0e8c796a2501c2fe0ad0ef9dcfad4e0deedf38e599700c1a7
 Image: percona/percona-xtradb-cluster-operator:1.6.0
 Image ID: docker-pullable://percona/percona-xtradb-cluster-operator@sha256:4ce6c8a55d8ed3a60c96c406ee103f70d303ebd97237e53d0e38fde75f848683
 Port: <none>
 Host Port: <none>
 Command:
 /pxc-init-entrypoint.sh
 State: Terminated
 Reason: Completed
 Exit Code: 0
 Started: Wed, 07 Oct 2020 11:14:10 -0400
 Finished: Wed, 07 Oct 2020 11:14:10 -0400
 Ready: True
 Restart Count: 0
 Requests:
 cpu: 600m
 memory: 1G
 Environment: <none>
 Mounts:
 /var/lib/mysql from datadir (rw)
 /var/run/secrets/kubernetes.io/serviceaccount from default-token-l7m9h (ro)
 pxc-init-unsafe:
 Container ID: docker://dadb5bb3f034a6b9c081d128e8c52aa9591d42b88ec4d14ff3a3b9e4fdaadda3
 Image: percona/percona-xtradb-cluster:8.0.20-11.1
 Image ID: docker-pullable://percona/percona-xtradb-cluster@sha256:54b1b2f5153b78b05d651034d4603a13e685cbb9b45bfa09a39864fa3f169349
 Ports: 3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP, 33062/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
 Command:
 /var/lib/mysql/unsafe-bootstrap.sh
 Args:
 mysqld
 State: Terminated
 Reason: Completed
 Exit Code: 0
 Started: Wed, 07 Oct 2020 11:14:11 -0400
 Finished: Wed, 07 Oct 2020 11:14:11 -0400
 Ready: True
 Restart Count: 0
 Requests:
 cpu: 600m
 memory: 1G
 Environment:
 PXC_SERVICE: cl2-pxc-unready
 MONITOR_HOST: %
 MYSQL_ROOT_PASSWORD: <set to the key 'root' in secret 'internal-cl2'> Optional: false
 XTRABACKUP_PASSWORD: <set to the key 'xtrabackup' in secret 'internal-cl2'> Optional: false
 MONITOR_PASSWORD: <set to the key 'monitor' in secret 'internal-cl2'> Optional: false
 CLUSTERCHECK_PASSWORD: <set to the key 'clustercheck' in secret 'internal-cl2'> Optional: false
 OPERATOR_ADMIN_PASSWORD: <set to the key 'operator' in secret 'internal-cl2'> Optional: false
 Mounts:
 /etc/my.cnf.d from auto-config (rw)
 /etc/mysql/ssl from ssl (rw)
 /etc/mysql/ssl-internal from ssl-internal (rw)
 /etc/mysql/vault-keyring-secret from vault-keyring-secret (rw)
 /etc/percona-xtradb-cluster.conf.d from config (rw)
 /tmp from tmp (rw)
 /var/lib/mysql from datadir (rw)
 /var/run/secrets/kubernetes.io/serviceaccount from default-token-l7m9h (ro)
Containers:
 pxc:
 Container ID: docker://58f0ae31d9308214cad4934fd7b66acbfa916f377e9ad4b3a5a748eed794306c
 Image: percona/percona-xtradb-cluster:8.0.20-11.1
 Image ID: docker-pullable://percona/percona-xtradb-cluster@sha256:54b1b2f5153b78b05d651034d4603a13e685cbb9b45bfa09a39864fa3f169349
 Ports: 3306/TCP, 4444/TCP, 4567/TCP, 4568/TCP, 33062/TCP
 Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
 Command:
 /var/lib/mysql/pxc-entrypoint.sh
 Args:
 mysqld
 State: Running
 Started: Wed, 07 Oct 2020 11:14:12 -0400
 Ready: False
 Restart Count: 0
 Requests:
 cpu: 600m
 memory: 1G
 Liveness: exec [/var/lib/mysql/liveness-check.sh] delay=300s timeout=5s period=10s #success=1 #failure=3
 Readiness: exec [/var/lib/mysql/readiness-check.sh] delay=15s timeout=15s period=30s #success=1 #failure=5
 Environment:
 PXC_SERVICE: cl2-pxc-unready
 MONITOR_HOST: %
 MYSQL_ROOT_PASSWORD: <set to the key 'root' in secret 'internal-cl2'> Optional: false
 XTRABACKUP_PASSWORD: <set to the key 'xtrabackup' in secret 'internal-cl2'> Optional: false
 MONITOR_PASSWORD: <set to the key 'monitor' in secret 'internal-cl2'> Optional: false
 CLUSTERCHECK_PASSWORD: <set to the key 'clustercheck' in secret 'internal-cl2'> Optional: false
 OPERATOR_ADMIN_PASSWORD: <set to the key 'operator' in secret 'internal-cl2'> Optional: false
 Mounts:
 /etc/my.cnf.d from auto-config (rw)
 /etc/mysql/ssl from ssl (rw)
 /etc/mysql/ssl-internal from ssl-internal (rw)
 /etc/mysql/vault-keyring-secret from vault-keyring-secret (rw)
 /etc/percona-xtradb-cluster.conf.d from config (rw)
 /tmp from tmp (rw)
 /var/lib/mysql from datadir (rw)
 /var/run/secrets/kubernetes.io/serviceaccount from default-token-l7m9h (ro)
Conditions:
 Type Status
 Initialized True 
 Ready False 
 ContainersReady False 
 PodScheduled True 
Volumes:
 datadir:
 Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
 ClaimName: datadir-cl2-pxc-1
 ReadOnly: false
 tmp:
 Type: EmptyDir (a temporary directory that shares a pod's lifetime)
 Medium: 
 SizeLimit: <unset>
 config:
 Type: ConfigMap (a volume populated by a ConfigMap)
 Name: cl2-pxc
 Optional: true
 ssl-internal:
 Type: Secret (a volume populated by a Secret)
 SecretName: my-cluster-ssl-internal
 Optional: true
 ssl:
 Type: Secret (a volume populated by a Secret)
 SecretName: my-cluster-ssl
 Optional: false
 auto-config:
 Type: ConfigMap (a volume populated by a ConfigMap)
 Name: auto-cl2-pxc
 Optional: true
 vault-keyring-secret:
 Type: Secret (a volume populated by a Secret)
 SecretName: keyring-secret-vault
 Optional: true
 default-token-l7m9h:
 Type: Secret (a volume populated by a Secret)
 SecretName: default-token-l7m9h
 Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
 Type Reason Age From Message
 ---- ------ ---- ---- -------
 Warning Unhealthy 53m kubelet, beast-node7-ubuntu Readiness probe failed: + [[ Primary == \P\r\i\m\a\r\y ]]
+ [[ 5 -eq 4 ]]
+ [[ 5 -eq 2 ]]
+ exit 1
 Warning Unhealthy 110s (x103 over 52m) kubelet, beast-node7-ubuntu Readiness probe failed: + [[ Disconnected == \P\r\i\m\a\r\y ]]
+ exit 1

log of node failure:

2020-10-07T15:18:14.093934Z 0 [Note] [MY-000000] [Galera] async IST sender served
2020-10-07T15:18:14.098880Z 0 [Note] [MY-000000] [Galera] 1.0 (cl2-pxc-0): State transfer from 2.0 (cl2-pxc-1) complete.
2020-10-07T15:18:14.099847Z 0 [Note] [MY-000000] [Galera] Member 1.0 (cl2-pxc-0) synced with group.
2020-10-07T15:42:51.252716Z 0 [Note] [MY-000000] [Galera] declaring 4d930040 at ssl://192.168.66.7:4567 stable
2020-10-07T15:42:51.252813Z 0 [Note] [MY-000000] [Galera] forgetting 0161f386 (ssl://192.168.71.197:4567)
2020-10-07T15:42:51.253609Z 0 [Note] [MY-000000] [Galera] Node 4d930040 state primary
2020-10-07T15:42:51.257245Z 2 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Update_rows event on table sbtest.warehouse9; Can't find record in 'warehouse9', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 0, Error_code: MY-001032
2020-10-07T15:42:51.257277Z 2 [Warning] [MY-000000] [WSREP] Event 3 Update_rows apply failed: 120, seqno 76599
2020-10-07T15:42:51.257968Z 2 [Note] [MY-000000] [Galera] Failed to apply write set: gtid: 9eb2648f-08af-11eb-ab37-7fa7bf32cdd9:76599 server_id: 4d930040-08b0-11eb-b309-5af603144c68 client_id: 2349 trx_id: 86802 flags: 3
2020-10-07T15:42:51.258561Z 2 [Note] [MY-000000] [Galera] Closing send monitor...
2020-10-07T15:42:51.258580Z 2 [Note] [MY-000000] [Galera] Closed send monitor.
2020-10-07T15:42:51.258594Z 2 [Note] [MY-000000] [Galera] gcomm: terminating thread
2020-10-07T15:42:51.258611Z 2 [Note] [MY-000000] [Galera] gcomm: joining thread
2020-10-07T15:42:51.258724Z 2 [Note] [MY-000000] [Galera] gcomm: closing backend
2020-10-07T15:42:51.290386Z 10 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Update_rows event on table sbtest.warehouse5; Can't find record in 'warehouse5', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 0, Error_code: MY-001032
2020-10-07T15:42:51.290412Z 10 [Warning] [MY-000000] [WSREP] Event 3 Update_rows apply failed: 120, seqno 76600
2020-10-07T15:42:51.290898Z 10 [Note] [MY-000000] [Galera] Failed to apply write set: gtid: 9eb2648f-08af-11eb-ab37-7fa7bf32cdd9:76600 server_id: 4d930040-08b0-11eb-b309-5af603144c68 client_id: 2388 trx_id: 86791 flags: 3
2020-10-07T15:42:51.324996Z 2 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,4d930040,6)
memb {
 4d930040,0
 c2b193c8,0
 }
joined {
 }
left {
 }
partitioned {
 0161f386,0
 }
)
2020-10-07T15:42:51.325043Z 2 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2020-10-07T15:42:51.325822Z 2 [Note] [MY-000000] [Galera] forgetting 0161f386 (ssl://192.168.71.197:4567)
2020-10-07T15:42:52.327208Z 2 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,4d930040,6)
memb {
 c2b193c8,0
 }
joined {
 }
left {
 }
partitioned {
 4d930040,0
 }
)
2020-10-07T15:42:52.327236Z 2 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2020-10-07T15:42:52.327260Z 2 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2020-10-07T15:42:52.327430Z 2 [Note] [MY-000000] [Galera] gcomm: closed
2020-10-07T15:42:52.327552Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2020-10-07T15:42:52.327630Z 0 [Note] [MY-000000] [Galera] STATE EXCHANGE: Waiting for state UUID.
2020-10-07T15:42:52.327723Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2020-10-07T15:42:52.327821Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [100, 100]
2020-10-07T15:42:52.327847Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.
2020-10-07T15:42:52.327866Z 0 [Note] [MY-000000] [Galera] Shifting SYNCED -> OPEN (TO: 76659)
2020-10-07T15:42:52.327902Z 0 [Note] [MY-000000] [Galera] New SELF-LEAVE.
2020-10-07T15:42:52.327952Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [0, 0]
2020-10-07T15:42:52.327980Z 0 [Note] [MY-000000] [Galera] Received SELF-LEAVE. Closing connection.
2020-10-07T15:42:52.328003Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> CLOSED (TO: -1)
2020-10-07T15:42:52.328025Z 0 [Note] [MY-000000] [Galera] RECV thread exiting 0: Success
2020-10-07T15:42:52.328048Z 10 [Note] [MY-000000] [Galera] ####### processing CC -1, local, ordered
2020-10-07T15:42:52.328114Z 10 [Note] [MY-000000] [Galera] ####### My UUID: c2b193c8-08af-11eb-b3f1-6a3d4c9d4080
2020-10-07T15:42:52.328155Z 2 [Note] [MY-000000] [Galera] recv_thread() joined.
2020-10-07T15:42:52.328156Z 10 [Note] [MY-000000] [Galera] ####### ST not required
2020-10-07T15:42:52.328165Z 2 [Note] [MY-000000] [Galera] Closing replication queue.
2020-10-07T15:42:52.328187Z 2 [Note] [MY-000000] [Galera] Closing slave action queue.
2020-10-07T15:42:52.328242Z 10 [Note] [MY-000000] [Galera] ================================================
View:
 id: 9eb2648f-08af-11eb-ab37-7fa7bf32cdd9:-1
 status: non-primary
 protocol_version: 4
 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
 final: no
 own_index: 0
 members(1):
 0: c2b193c8-08af-11eb-b3f1-6a3d4c9d4080, cl2-pxc-1
=================================================
2020-10-07T15:42:52.328271Z 10 [Note] [MY-000000] [Galera] Non-primary view
2020-10-07T15:42:52.328291Z 10 [Note] [MY-000000] [WSREP] Server status change synced -> connected
2020-10-07T15:42:52.329196Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2020-10-07T15:42:52.331451Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2020-10-07T15:42:52.331537Z 10 [Note] [MY-000000] [Galera] ####### processing CC -1, local, ordered
2020-10-07T15:42:52.331573Z 10 [Note] [MY-000000] [Galera] ####### My UUID: c2b193c8-08af-11eb-b3f1-6a3d4c9d4080
2020-10-07T15:42:52.331595Z 10 [Note] [MY-000000] [Galera] ####### ST not required
2020-10-07T15:42:52.331632Z 10 [Note] [MY-000000] [Galera] ================================================
View:
 id: 9eb2648f-08af-11eb-ab37-7fa7bf32cdd9:-1
 status: non-primary
 protocol_version: 4
 capabilities: MULTI-MASTER, CERTIFICATION, PARALLEL_APPLYING, REPLAY, ISOLATION, PAUSE, CAUSAL_READ, INCREMENTAL_WS, UNORDERED, PREORDERED, STREAMING, NBO
 final: yes
 own_index: -1
 members(0):
=================================================
2020-10-07T15:42:52.331658Z 10 [Note] [MY-000000] [Galera] Non-primary view
2020-10-07T15:42:52.331677Z 10 [Note] [MY-000000] [WSREP] Server status change connected -> disconnected
2020-10-07T15:42:52.331699Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2020-10-07T15:42:52.331721Z 10 [Note] [MY-000000] [WSREP] wsrep_notify_cmd is not defined, skipping notification.
2020-10-07T15:42:52.338267Z 2 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 6 thd: 2
2020-10-07T15:42:52.342107Z 0 [Note] [MY-000000] [Galera] Service thread queue flushed.
2020-10-07T15:42:52.342187Z 10 [Note] [MY-000000] [Galera] ####### Assign initial position for certification: 00000000-0000-0000-0000-000000000000:-1, protocol version: 5
2020-10-07T15:42:52.342234Z 10 [Note] [MY-000000] [WSREP] Applier thread exiting ret: 0 thd: 10
2020-10-07T15:43:59.501050Z 1579 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known
2020-10-07T15:44:08.665507Z 1586 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known
2020-10-07T15:44:18.710482Z 1595 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known
2020-10-07T15:44:22.049759Z 1599 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known
2020-10-07T15:44:25.430895Z 1602 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known

2020-10-07T15:44:28.761998Z 1605 [Warning] [MY-010056] [Server] Host name '192-168-71-196.cl2-haproxy-replicas.pxc.svc.cluster.local' could not be resolved: Name or service not known

Environment

None

Smart Checklist

Activity

Slava Sarzhan December 18, 2020 at 3:12 PM

P.S. I can reproduce this specific case but I need to know how. Please ping me if it is needed.

Slava Sarzhan December 18, 2020 at 3:09 PM

Hi @Vadim Tkachenko , I have found under the task https://perconadev.atlassian.net/browse/K8SPXC-564#icft=K8SPXC-564 that our liveness probe works incorrectly(if 'sst_in_progress' file exists we can not restart the pod because sst is in progress, but we had "progress=$DATADIR/sst_in_progress" option in node.cnf and this file is created on donor node but is never deleted). This issue will be fixed under https://perconadev.atlassian.net/browse/K8SPXC-564#icft=K8SPXC-564.

Duplicate

Details
Assignee
Slava Sarzhan
Reporter
Vadim Tkachenko
Labels
bug-newdiscover-percona
Time tracking
30m logged
Priority
Medium

Smart Checklist

Created October 7, 2020 at 4:38 PM

Updated March 5, 2024 at 6:05 PM

Resolved December 18, 2020 at 3:13 PM

pod becomes unavailable but not restarted automatically

Description

Environment

Smart Checklist

Activity

Slava Sarzhan December 18, 2020 at 3:12 PM

Slava Sarzhan December 18, 2020 at 3:09 PM

Details
Assignee
Slava Sarzhan
Reporter
Vadim Tkachenko
Labels
bug-newdiscover-percona
Time tracking
30m logged
Priority
Medium

Details

Assignee

Reporter

Labels

Time tracking

Priority

Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

pod becomes unavailable but not restarted automatically

Description

Environment

Smart Checklist

Activity

Slava Sarzhan December 18, 2020 at 3:12 PM

Slava Sarzhan December 18, 2020 at 3:09 PM

DetailsAssigneeSlava SarzhanSlava SarzhanReporterVadim TkachenkoVadim TkachenkoLabelsbug-newdiscover-perconaTime tracking30m loggedPriorityMedium

Details

Assignee

Reporter

Labels

Time tracking

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
Slava Sarzhan
Reporter
Vadim Tkachenko
Labels
bug-newdiscover-percona
Time tracking
30m logged
Priority
Medium

Smart Checklist