Certificate Rotation brought the Sharded MongoDB cluster down
General
Escalation
General
Escalation
Description
Regular certificate rotation brings the Sharded MongoDB cluster down.
Chain of events:
Cert-Manager detects Mongo certs would expire soon and triggers certificate rotation.
Percona operator for MongoDB discovers changes in the certificates and triggers SmartUpdate.
SmartUpdate fails and leaves the Mongos in CrashLoopBackoff state. All connections to MongoDB are closed.
Attached is the complete PSMDB manifest with few values redacted, also logs from Cert-Manager and Percona Operator for MongoDB. We did not preserve the logs from Mongos or Mongod.
This happened on 2 different environments already. First time we killed all RS0 pods together to restore the MongoDB cluster. Second time spent a bit more time trying different options. Restarting single pods from Mongos or RS0 statefulsets did not help. The thing that helped was restarting the last RS0 pod. It looks like the RS0 statefulset gets stuck somehow and restarting the last pod allows the rollout or SmartUpdate to proceed.
Environment
None
Attachments
3
Activity
Olivier Doucet September 15, 2023 at 7:57 AM
it may be difficult to test, as source code is referencing the release version in the code :
Regular certificate rotation brings the Sharded MongoDB cluster down.
Chain of events:
Cert-Manager detects Mongo certs would expire soon and triggers certificate rotation.
Percona operator for MongoDB discovers changes in the certificates and triggers SmartUpdate.
SmartUpdate fails and leaves the Mongos in CrashLoopBackoff state. All connections to MongoDB are closed.
Attached is the complete PSMDB manifest with few values redacted, also logs from Cert-Manager and Percona Operator for MongoDB. We did not preserve the logs from Mongos or Mongod.
This happened on 2 different environments already. First time we killed all RS0 pods together to restore the MongoDB cluster. Second time spent a bit more time trying different options. Restarting single pods from Mongos or RS0 statefulsets did not help. The thing that helped was restarting the last RS0 pod. It looks like the RS0 statefulset gets stuck somehow and restarting the last pod allows the rollout or SmartUpdate to proceed.