Operator is vulnerable to misoperations for multiple properties in CR and drives the cluster to broken state

Description

Hello xtradb operator developers,

We found that many properties in the CR are very easy to drive the cluster into a broken state if not handled carefully.
For example, when specifying a bad value for the properties in `spec.pxc.affinity.advanced`, it causes the statefulSet to restart but the restarted pod cannot be scheduled. There are a lot of other examples include specifying a nonexistent secret for `spec.pxc.envVarsSecret`, wrong `hookScript`, non existent `spec.pxc.priorityClassName`

A concrete example is to submit the a CR with following advanced affinity
```
spec:
  pxc:
    affinity:
      advanced
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - 'NULL'
...
```
The operator updates the statefulSet which triggers a rolling update, but the newly started pxc pod cannot be started because no node satisfies the affinity rule.

It causes severe consequences in production. We believe these are misoperation vulnerabilities where the operator fails to reject the misoperation from users. The operator uses controller-gen to automatically generate validations for many properties, but these static validations fall short for validating more complicated contraints, e.g. to reject an invalid nodeSelector needs knowledge of which nodes are available in Kubernetes cluster, validating whether Affinity rule is satisfiable requires the scheduler knowledge. Validating hookscript is even more challenging

We want to open this issue to discuss what do you think should be the best practice to handle this issue, or what functionalities should the Kubernetes provide to make this validation easier. Is there a way to prevent the bad operation from happening in the first place, or there is a way for the operator to automatically recognize the statefulSet is stuck and perform an automatic recovery. If you know of any practical code fixes for this issue, we are also happy to send a PR for that.

We are also happy to provide the full list of properties which are vulnerable to misoperations if you are interested

Environment

None

Activity

Aaditya Dubey April 20, 2023 at 4:50 AM

Hi ,

Thank you for the report.
Sending the concern to engineering for further review and updates.
Meanwhile please share the PR if available our development team will take a look.

Details

Assignee

Reporter

Needs QA

Yes

Affects versions

Priority

Smart Checklist

Created April 12, 2023 at 7:57 PM
Updated March 5, 2024 at 5:27 PM