Certificate Rotation brought the Sharded MongoDB cluster down

General

Escalation

General

Escalation

Description

Regular certificate rotation brings the Sharded MongoDB cluster down.

Chain of events:

Cert-Manager detects Mongo certs would expire soon and triggers certificate rotation.
Percona operator for MongoDB discovers changes in the certificates and triggers SmartUpdate.
SmartUpdate fails and leaves the Mongos in CrashLoopBackoff state. All connections to MongoDB are closed.

Attached is the complete PSMDB manifest with few values redacted, also logs from Cert-Manager and Percona Operator for MongoDB. We did not preserve the logs from Mongos or Mongod.

This happened on 2 different environments already. First time we killed all RS0 pods together to restore the MongoDB cluster. Second time spent a bit more time trying different options. Restarting single pods from Mongos or RS0 statefulsets did not help. The thing that helped was restarting the last RS0 pod. It looks like the RS0 statefulset gets stuck somehow and restarting the last pod allows the rollout or SmartUpdate to proceed.

Environment

None

Attachments

Activity

Olivier Doucet September 15, 2023 at 7:57 AM

it may be difficult to test, as source code is referencing the release version in the code :

https://github.com/percona/percona-server-mongodb-operator/commit/a76717b725322a39fa377e78e3264aaf3488c413#diff-16a2be6f78b2b66122115cb7bdd8779db4f79481ef6b2e6ed6a140d540381c0fR59

We are eagerly waiting for the official release.

Slava Sarzhan September 11, 2023 at 10:54 AM

, the issue was fixed under https://github.com/percona/percona-server-mongodb-operator/pull/1287 and will be available in the next PSMDB operator release. We have a plan to have it in Q3. Feel free to test it in the main branch (but only for test needs).

Olivier Doucet September 8, 2023 at 10:03 AM

We do have the same issue for the last months. When it happens, we also need to kill rs0 then mongos then cfg pods to make it work again.

This issue happened around 20 times on 5 different mongoDB clusters, on 5 different kubernetes cluster (versions from 1.20 to 1.25).

Done

Details
Assignee
Unassigned
Reporter
Stiliyan Stefanov
Needs QA
Yes
Needs Doc
Yes
Fix versions
1.15.0
Affects versions
1.14.0
Priority
Critical

Smart Checklist

Created August 8, 2023 at 2:24 PM

Updated March 5, 2024 at 4:27 PM

Resolved October 9, 2023 at 2:11 PM

Certificate Rotation brought the Sharded MongoDB cluster down

Description

Environment

Attachments

Activity

Olivier Doucet September 15, 2023 at 7:57 AM

Slava Sarzhan September 11, 2023 at 10:54 AM

Olivier Doucet September 8, 2023 at 10:03 AM

DetailsAssigneeUnassignedUnassignedReporterStiliyan StefanovStiliyan StefanovNeeds QAYesNeeds DocYesFix versions1.15.0Affects versions1.14.0PriorityCritical

Details

Assignee

Reporter

Needs QA

Needs Doc

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

Details
Assignee
Unassigned
Reporter
Stiliyan Stefanov
Needs QA
Yes
Needs Doc
Yes
Fix versions
1.15.0
Affects versions
1.14.0
Priority
Critical

Smart Checklist