[BUG] Mongos pod never gets ready and complains about network issues after enabling sharding
General
Escalation
General
Escalation
Description
We find that after setting spec.sharding.enabled to true to enable sharding, the mongos pod sometimes never gets ready and the sharding feature is also blocked:
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
mongodb-cluster-cfg-0 1/1 Running 0 41m
mongodb-cluster-cfg-1 1/1 Running 0 41m
mongodb-cluster-cfg-2 1/1 Running 0 41m
mongodb-cluster-mongos-9c6576fb6-z8h7h 0/1 Running 10 41m
mongodb-cluster-rs0-0 1/1 Running 0 43m
mongodb-cluster-rs0-1 1/1 Running 0 41m
mongodb-cluster-rs0-2 1/1 Running 0 41m
percona-server-mongodb-operator-9cfc677df-kq5rw 1/1 Running 1 43m
The mongos pod keeps complaining about network issues:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 44m kubelet Pulling image "ghcr.io/ourproject/action/mongodb-operator:latest"
Normal Scheduled 44m default-scheduler Successfully assigned default/mongodb-cluster-mongos-9c6576fb6-z8h7h to kind-worker2
Normal Pulled 44m kubelet Successfully pulled image "ghcr.io/ourproject/action/mongodb-operator:latest"
Normal Created 44m kubelet Created container mongo-init
Normal Started 44m kubelet Started container mongo-init
Normal Pulled 44m kubelet Successfully pulled image "percona/percona-server-mongodb:4.4.3-5"
Normal Created 44m kubelet Created container mongos
Normal Started 44m kubelet Started container mongos
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tc$ [::1]:27017: connect: connection refused","time":"2021-10-21T19:50:50Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:50Z"}
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701
7: connect: connection refused","time":"2021-10-21T19:50:51Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:51Z"}
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701
7: connect: connection refused","time":"2021-10-21T19:50:52Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:52Z"}
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701
7: connect: connection refused","time":"2021-10-21T19:50:53Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:53Z"}
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701
7: connect: connection refused","time":"2021-10-21T19:50:54Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:54Z"}
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701
7: connect: connection refused","time":"2021-10-21T19:50:55Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:55Z"}
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701
7: connect: connection refused","time":"2021-10-21T19:50:56Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:56Z"}
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701
7: connect: connection refused","time":"2021-10-21T19:50:57Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:57Z"}
Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701
7: connect: connection refused","time":"2021-10-21T19:50:58Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:58Z"}
Warning Unhealthy 14m (x1435 over 44m) kubelet (combined from similar events): Readiness probe failed: {"level":"error","msg":"could not con
nect to localhost:27017. got: dial tcp [::1]:27017: connect: connection refused","time":"2021-10-21T20:20:39Z"}
{"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T20:20:39Z"}
Warning BackOff 9m35s (x28 over 20m) kubelet Back-off restarting failed container
Normal Pulling 4m36s (x11 over 44m) kubelet Pulling image "percona/percona-server-mongodb:4.4.3-5"
And in the mongos pod log we can see some network interface related errors keeps happening:
So the kubelet fails to connect to the port 27017 (which is supposed to be listened by mongodb) and leads to readiness probe failure. If the readiness proble cannot succeed, the pod will be restarted.
Additional information:
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 52m
mongodb-cluster-cfg ClusterIP None <none> 27017/TCP 48m
mongodb-cluster-mongos ClusterIP 10.96.113.167 <none> 27017/TCP 47m
mongodb-cluster-rs0 ClusterIP None <none> 27017/TCP 50m
$ kubectl get endpoints
NAME ENDPOINTS AGE
kubernetes 172.18.0.4:6443 52m
mongodb-cluster-cfg 10.244.1.6:27017,10.244.2.9:27017,10.244.3.6:27017 49m
mongodb-cluster-mongos 48m
mongodb-cluster-rs0 10.244.1.7:27017,10.244.2.10:27017,10.244.3.3:27017 50m
We are trying to figure out why sometimes kubelet can connect to mongodb (27017), but this is really difficult as this problem happens in a nondeterministic manner and we have not found a way to reliably reproduce it.
Environment
None
Smart Checklist
Activity
Aaditya Dubey January 19, 2023 at 9:49 AM
Hi @sieveteam,
We still haven't heard any news from you. So I assume reported issue is not persist and will close the ticket. If you disagree just reply and create a follow-up.
Aaditya Dubey June 2, 2022 at 8:09 AM
Hi @sieveteam ,
Thank you for the updates. should we keep this ticket active until this issue further visible?
sieveteam May 30, 2022 at 6:27 PM
Hi Aaditya,
We are not sure whether the problem still exists or not because it is very hard to reproduce and we didn't figure out the root cause.
Aaditya Dubey May 25, 2022 at 9:41 AM
Hi @sieveteam ,
Thank you for the report. please let me know if issue still persists.
sieveteam November 27, 2021 at 9:12 PM
Hello Sergey, the problem is not resolved automatically as the probe never passes.
It lasts for hours until we tear down the MongoDB cluster.
We find that after setting
spec.sharding.enabled
to true to enable sharding, the mongos pod sometimes never gets ready and the sharding feature is also blocked:$ kubectl get pod NAME READY STATUS RESTARTS AGE mongodb-cluster-cfg-0 1/1 Running 0 41m mongodb-cluster-cfg-1 1/1 Running 0 41m mongodb-cluster-cfg-2 1/1 Running 0 41m mongodb-cluster-mongos-9c6576fb6-z8h7h 0/1 Running 10 41m mongodb-cluster-rs0-0 1/1 Running 0 43m mongodb-cluster-rs0-1 1/1 Running 0 41m mongodb-cluster-rs0-2 1/1 Running 0 41m percona-server-mongodb-operator-9cfc677df-kq5rw 1/1 Running 1 43m
The
mongos
pod keeps complaining about network issues:Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Pulling 44m kubelet Pulling image "ghcr.io/ourproject/action/mongodb-operator:latest" Normal Scheduled 44m default-scheduler Successfully assigned default/mongodb-cluster-mongos-9c6576fb6-z8h7h to kind-worker2 Normal Pulled 44m kubelet Successfully pulled image "ghcr.io/ourproject/action/mongodb-operator:latest" Normal Created 44m kubelet Created container mongo-init Normal Started 44m kubelet Started container mongo-init Normal Pulled 44m kubelet Successfully pulled image "percona/percona-server-mongodb:4.4.3-5" Normal Created 44m kubelet Created container mongos Normal Started 44m kubelet Started container mongos Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tc$ [::1]:27017: connect: connection refused","time":"2021-10-21T19:50:50Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:50Z"} Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701 7: connect: connection refused","time":"2021-10-21T19:50:51Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:51Z"} Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701 7: connect: connection refused","time":"2021-10-21T19:50:52Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:52Z"} Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701 7: connect: connection refused","time":"2021-10-21T19:50:53Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:53Z"} Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701 7: connect: connection refused","time":"2021-10-21T19:50:54Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:54Z"} Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701 7: connect: connection refused","time":"2021-10-21T19:50:55Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:55Z"} Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701 7: connect: connection refused","time":"2021-10-21T19:50:56Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:56Z"} Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701 7: connect: connection refused","time":"2021-10-21T19:50:57Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:57Z"} Warning Unhealthy 44m kubelet Readiness probe failed: {"level":"error","msg":"could not connect to localhost:27017. got: dial tcp [::1]:2701 7: connect: connection refused","time":"2021-10-21T19:50:58Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T19:50:58Z"} Warning Unhealthy 14m (x1435 over 44m) kubelet (combined from similar events): Readiness probe failed: {"level":"error","msg":"could not con nect to localhost:27017. got: dial tcp [::1]:27017: connect: connection refused","time":"2021-10-21T20:20:39Z"} {"level":"fatal","msg":"connection error: no reachable servers","time":"2021-10-21T20:20:39Z"} Warning BackOff 9m35s (x28 over 20m) kubelet Back-off restarting failed container Normal Pulling 4m36s (x11 over 44m) kubelet Pulling image "percona/percona-server-mongodb:4.4.3-5"
And in the
mongos
pod log we can see some network interface related errors keeps happening:{"t":{"$date":"2021-10-21T20:25:31.048+00:00"},"s":"I", "c":"SHARDING", "id":22740, "ctx":"monitoring-keys-for-HMAC","msg":"Operation timed out","attr":{"error":{"code":202,"codeName":"NetworkInterfaceExceededTimeLimit","errmsg":"Request 68 timed out, deadline was 2021-10-21T20:25:31.048+00:00, op was RemoteCommand 68 -- target:[mongodb-cluster-cfg-0.mongodb-cluster-cfg.default.svc.cluster.local:27017] db:admin expDate:2021-10-21T20:25:31.048+00:00 cmd:{ find: \"system.keys\", filter: { purpose: \"HMAC\", expiresAt: { $gt: Timestamp(0, 0) } }, sort: { expiresAt: 1 }, readConcern: { level: \"majority\", afterOpTime: { ts: Timestamp(0, 0), t: -1 } }, maxTimeMS: 30000 }"}}}
So the kubelet fails to connect to the port 27017 (which is supposed to be listened by mongodb) and leads to readiness probe failure. If the readiness proble cannot succeed, the pod will be restarted.
Additional information:
$ kubectl get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 52m mongodb-cluster-cfg ClusterIP None <none> 27017/TCP 48m mongodb-cluster-mongos ClusterIP 10.96.113.167 <none> 27017/TCP 47m mongodb-cluster-rs0 ClusterIP None <none> 27017/TCP 50m $ kubectl get endpoints NAME ENDPOINTS AGE kubernetes 172.18.0.4:6443 52m mongodb-cluster-cfg 10.244.1.6:27017,10.244.2.9:27017,10.244.3.6:27017 49m mongodb-cluster-mongos 48m mongodb-cluster-rs0 10.244.1.7:27017,10.244.2.10:27017,10.244.3.3:27017 50m
We are trying to figure out why sometimes kubelet can connect to mongodb (27017), but this is really difficult as this problem happens in a nondeterministic manner and we have not found a way to reliably reproduce it.