[BUG] Stale deletionTimestamp information leads to undesired statefulset and PVC deletion
General
Escalation
General
Escalation
Description
Description
We find that mongodb-operator could accidentally delete the statefulset and PVC belonging to the PerconaServerMongoDB after experiencing a restart and talking to one stale apiserver in an HA k8s cluster. Although the root cause is in the staleness from apiserver, the controller code can be further enhanced to avoid such undesired deletion.
More concretely, if the controller receives a stale update event saying "PerconaServerMongoDB has non-nil DeletionTimestamp" from the stale apiserver, `deletePvcFinalizer` will be invoked to delete both the PVC and the statefulset. Note that the controller only lists the related PVC and statefulset using clusterLabels (i.e., the name), so it cannot tell whether the listed PVC and statefulset really belong to the "to-be-deleted" PerconaServerMongoDB when it shares the name with the currently running one. One potential way to enhance this part is to double-check that the controller reference UID of each listed statefulset matches the UID of the "to-be-deleted" PerconaServerMongoDB. We have attached a patch for the UID checking before deleting a statefulset. The UID checking is located in `deleteAllStatefulsets`: ``` func (r *ReconcilePerconaServerMongoDB) deleteAllStatefulsets(cr *api.PerconaServerMongoDB) error { stsList, err := r.getAllstatefulsets(cr) if err != nil { return errors.Wrap(err, "failed to get StatefulSet list") }
for _, sts := range stsList.Items { + if !r.matchUID(cr, &sts) { + continue + } log.Info("deleting StatefulSet", "name", sts.Name) err := r.client.Delete(context.TODO(), &sts) if err != nil { return errors.Wrapf(err, "failed to delete StatefulSet %s", sts.Name) } } return nil } ``` Since every two objects, even sharing the same name, should have a different UID, checking UID before deletion can help avoid the above problem.
Reproduction
We list concrete reproduction steps as below (in a HA k8s cluster):
1. Create a PerconaServerMongoDB instance `mdb` with finalizer `delete-psmdb-pvc`, and a `replset` sized 3. The controller is talking to apiserver1 (that is not stale). A statefulset and three PVC will be created in the k8s cluster. 2. Delete `mdb`. Apiserver1 will send the update events with a non-nil DeletionTimestamp to the controller, which triggers `deletePvcFinalizer` to delete the related statefulset and PVC during reconcile. Meanwhile, apiserver2 is partitioned so its watch cache stops at the moment that `mdb` is tagged with a non-nil DeletionTimestamp. 3. Create the PerconaServerMongoDB with the same name `mdb`. Now the statefulset and PVC get back. However, apiserver2 still holds the stale view that `mdb` has a non-nil DeletionTimestamp and is about to be deleted. 4. The controller restarts after a node failure and talks to the stale apiserver2. Reading the stale information (non-nil DeletionTimestamp) about the old `mdb` from apiserver2, `deletePvcFinalizer` gets invoked again and the statefulset and PVC belonging to the existing PerconaServerMongoDB gets listed (as mentioned above) and deleted.
Fix
We have attached a patch to help fix this issue. As mentioned above, the patch performs UID checking before deleting each statefulset. It seems that PVC does not have a controller reference, we can later send another patch to label each PVC with the PerconaServerMongoDB UID and perform the same check before deleting the PVC.
Environment
None
Attachments
2
06 Apr 2021, 06:27 PM
06 Apr 2021, 06:27 PM
Smart Checklist
Activity
Sergey Pronin April 7, 2021 at 11:15 AM
@sieveteam this is great! Thank you for submitting the PR.
Description
We find that mongodb-operator could accidentally delete the statefulset and PVC belonging to the PerconaServerMongoDB after experiencing a restart and talking to one stale apiserver in an HA k8s cluster. Although the root cause is in the staleness from apiserver, the controller code can be further enhanced to avoid such undesired deletion.
More concretely, if the controller receives a stale update event saying "PerconaServerMongoDB has non-nil DeletionTimestamp" from the stale apiserver, `deletePvcFinalizer` will be invoked to delete both the PVC and the statefulset. Note that the controller only lists the related PVC and statefulset using clusterLabels (i.e., the name), so it cannot tell whether the listed PVC and statefulset really belong to the "to-be-deleted" PerconaServerMongoDB when it shares the name with the currently running one. One potential way to enhance this part is to double-check that the controller reference UID of each listed statefulset matches the UID of the "to-be-deleted" PerconaServerMongoDB. We have attached a patch for the UID checking before deleting a statefulset. The UID checking is located in `deleteAllStatefulsets`:
```
func (r *ReconcilePerconaServerMongoDB) deleteAllStatefulsets(cr *api.PerconaServerMongoDB) error {
stsList, err := r.getAllstatefulsets(cr)
if err != nil {
return errors.Wrap(err, "failed to get StatefulSet list")
}
for _, sts := range stsList.Items {
+ if !r.matchUID(cr, &sts) {
+ continue
+ }
log.Info("deleting StatefulSet", "name", sts.Name)
err := r.client.Delete(context.TODO(), &sts)
if err != nil {
return errors.Wrapf(err, "failed to delete StatefulSet %s", sts.Name)
}
}
return nil
}
```
Since every two objects, even sharing the same name, should have a different UID, checking UID before deletion can help avoid the above problem.
Reproduction
We list concrete reproduction steps as below (in a HA k8s cluster):
1. Create a PerconaServerMongoDB instance `mdb` with finalizer `delete-psmdb-pvc`, and a `replset` sized 3. The controller is talking to apiserver1 (that is not stale). A statefulset and three PVC will be created in the k8s cluster.
2. Delete `mdb`. Apiserver1 will send the update events with a non-nil DeletionTimestamp to the controller, which triggers `deletePvcFinalizer` to delete the related statefulset and PVC during reconcile. Meanwhile, apiserver2 is partitioned so its watch cache stops at the moment that `mdb` is tagged with a non-nil DeletionTimestamp.
3. Create the PerconaServerMongoDB with the same name `mdb`. Now the statefulset and PVC get back. However, apiserver2 still holds the stale view that `mdb` has a non-nil DeletionTimestamp and is about to be deleted.
4. The controller restarts after a node failure and talks to the stale apiserver2. Reading the stale information (non-nil DeletionTimestamp) about the old `mdb` from apiserver2, `deletePvcFinalizer` gets invoked again and the statefulset and PVC belonging to the existing PerconaServerMongoDB gets listed (as mentioned above) and deleted.
Fix
We have attached a patch to help fix this issue. As mentioned above, the patch performs UID checking before deleting each statefulset. It seems that PVC does not have a controller reference, we can later send another patch to label each PVC with the PerconaServerMongoDB UID and perform the same check before deleting the PVC.