Certificate Rotation brought the Sharded MongoDB cluster down

General

Escalation

General

Escalation

Description

Regular certificate rotation brings the Sharded MongoDB cluster down.

Chain of events:

Cert-Manager detects Mongo certs would expire soon and triggers certificate rotation.
Percona operator for MongoDB discovers changes in the certificates and triggers SmartUpdate.
SmartUpdate fails and leaves the Mongos in CrashLoopBackoff state. All connections to MongoDB are closed.

Attached is the complete PSMDB manifest with few values redacted, also logs from Cert-Manager and Percona Operator for MongoDB. We did not preserve the logs from Mongos or Mongod.

This happened on 2 different environments already. First time we killed all RS0 pods together to restore the MongoDB cluster. Second time spent a bit more time trying different options. Restarting single pods from Mongos or RS0 statefulsets did not help. The thing that helped was restarting the last RS0 pod. It looks like the RS0 statefulset gets stuck somehow and restarting the last pod allows the rollout or SmartUpdate to proceed.

Environment

None

Attachments

Activity

Stiliyan Stefanov
December 1, 2023 at 3:05 PM

Hi, downtime is not an option for us, even if it is planned and controlled. I would prefer an official fix that allows the operator to handle such certificate rotation or a way to bypass this authentication method between components. No certificates would also be an option.

As for the fix once the issue happens, we can just kill all the RS pods and perhaps the Mongos to bring the server up, but at this point we are already down.

Slava Sarzhan
December 1, 2023 at 2:55 PM

Hi , you can fix this issue but unfortunately, only with downtime now. What you need to do:
1. pause cluster using spec.pause: true option
2. delete old certificates, issuer and secrets

3. unpause cluster

New issuers, certificates and secrets will be created and then rotation will work.

P.S. We are thinking about how to improve this behavior to avoid downtime.

Slava Sarzhan
November 27, 2023 at 9:28 AM

Hi , fast update for you. When you run v 1.15.0 from scratch rotation is working but if the operator was updated from <= 1.14.0 to 1.15.0 it does not work correctly. We are improving this behavior.

Slava Sarzhan
November 22, 2023 at 8:59 AM

Hi, . We are checking this issue.

Stiliyan Stefanov
November 22, 2023 at 8:57 AM

It is actually easy to reproduce. Add spec.renewBefore: 2159h0m0s to the Certificate resource to cause cert-manager renew the certificate 89 days and 23 hours before the expiration (every 1 hour when the validity is 90 days).

Resize issue view side panel

Done

Details

Assignee

dmitriy.kostiuk(Deactivated)

Reporter

Stiliyan Stefanov

Labels

tls

Needs QA

Yes

Needs Doc

Yes

Fix versions

1.16.0

Affects versions

1.15.0

Priority

Critical

Created November 22, 2023 at 8:12 AM

Updated May 24, 2024 at 4:20 PM

Resolved May 24, 2024 at 10:34 AM