Certificate Rotation brought the Sharded MongoDB cluster down

Description

Regular certificate rotation brings the Sharded MongoDB cluster down.

Chain of events:

  • Cert-Manager detects Mongo certs would expire soon and triggers certificate rotation.

  • Percona operator for MongoDB discovers changes in the certificates and triggers SmartUpdate.

  • SmartUpdate fails and leaves the Mongos in CrashLoopBackoff state. All connections to MongoDB are closed.

Attached is the complete PSMDB manifest with few values redacted, also logs from Cert-Manager and Percona Operator for MongoDB. We did not preserve the logs from Mongos or Mongod.

This happened on 2 different environments already. First time we killed all RS0 pods together to restore the MongoDB cluster. Second time spent a bit more time trying different options. Restarting single pods from Mongos or RS0 statefulsets did not help. The thing that helped was restarting the last RS0 pod. It looks like the RS0 statefulset gets stuck somehow and restarting the last pod allows the rollout or SmartUpdate to proceed.

Environment

None

Attachments

3

Activity

Olivier Doucet September 15, 2023 at 7:57 AM

it may be difficult to test, as source code is referencing the release version in the code : 

https://github.com/percona/percona-server-mongodb-operator/commit/a76717b725322a39fa377e78e3264aaf3488c413#diff-16a2be6f78b2b66122115cb7bdd8779db4f79481ef6b2e6ed6a140d540381c0fR59

 

We are eagerly waiting for the official release.

Slava Sarzhan September 11, 2023 at 10:54 AM

, the issue was fixed under https://github.com/percona/percona-server-mongodb-operator/pull/1287 and will be available in the next PSMDB operator release. We have a plan to have it in Q3. Feel free to test it in the main branch (but only for test needs).

Olivier Doucet September 8, 2023 at 10:03 AM

We do have the same issue for the last months. When it happens, we also need to kill rs0 then mongos then cfg pods to make it work again.

This issue happened around 20 times on 5 different mongoDB clusters, on 5 different kubernetes cluster (versions from 1.20 to 1.25).

Done

Details

Assignee

Reporter

Needs QA

Yes

Needs Doc

Yes

Fix versions

Affects versions

Priority

Smart Checklist

Created August 8, 2023 at 2:24 PM
Updated March 5, 2024 at 4:27 PM
Resolved October 9, 2023 at 2:11 PM