Issues
- Allow setting loadBalancerClassK8SPS-436
- Don't run scheduled backup if DB is unhealthyK8SPS-435
- Make DB version upgrade non-blockingK8SPS-434George Kechagias
- Add upgrade in progress annotationK8SPS-432
- Documentation is mentioning PostgreSQL instead of MySQLK8SPS-431dmitriy.kostiuk
- Operator doesn't update TLS certificates if SANs are updated in CRK8SPS-430
- Consider explicitly failing over the primary in smart updateK8SPS-429
- gitignore the vendor directoryK8SPS-428George Kechagias
- Validate CR input using k8s validations and remove the operator validation logic and related loggingK8SPS-427
- Add Label for CRDK8SPS-426Julio Pasinatto
- Fix group replication bootstrap for 8.4K8SPS-425ege.gunes
- Update Jenkinsfile to use 1.29 cluster versionK8SPS-424
- Implement NOTIFY_SOCKET and state-monitorK8SPS-423Andrii Dema
- Improve database updateK8SPS-422
- Add data at rest encryption supportK8SPS-421
- Data at rest encryption supportK8SPS-420
- Add support for postStart/preStop hooksK8SPS-419
- Add gracePeriod supportK8SPS-418
- Add podDisruptionBudget supportK8SPS-417
- Add tolerations into our CRK8SPS-416
- Add runtimeClassName into our CRK8SPS-415
- Add imagePullSecrets into our CRK8SPS-414
- Add possibility of adding resources and containerSecurityContext for init containersK8SPS-413
- Improve basic featuresK8SPS-412
- Add asynchronous replication setup for PS cluster running in K8SK8SPS-411
- Add incremental backupsK8SPS-410
- Add encrypted backupsK8SPS-409
- Create and sync application mysql users during deploymentK8SPS-408
- Add database user management supportK8SPS-407
- Add possibility of adding custom parameters via CR for PMM clientK8SPS-406
- Allow configuring PMM container LivenessProbeK8SPS-405
- Add PMM supportK8SPS-404
- Add PITR supportK8SPS-403
- Backups throttling, parallel execution control and cancellationK8SPS-402
- Add example of azure backup type into our CRsK8SPS-401
- Add support for custom options for xtrabackup, xbstream, xbcloudK8SPS-400
- Improve backups and restoresK8SPS-399Slava Sarzhan
- Release 0.10.0K8SPS-398
- Cluster-Wide upgrade from v0.8.0 to v0.9.0 causes CrashLoopBackOff and prevents cluster recoveryK8SPS-397
- Investgate why test cases gr-self-healing and gr-self-healing have different DaemonSets for the different operatorsK8SPS-396Julio Pasinatto
- Output warning about CRDs after helm upgradeK8SPS-395
- When we change cluster type on a running cluster, it should fail properlyK8SPS-394
- Add support for PMM v3K8SPS-393Julio Pasinatto
- Add ability to increase timeout for the CLONE operationK8SPS-392Julio Pasinatto
- Shfmt script consistent formattingK8SPS-391Resolved issue: K8SPS-391George Kechagias
- Automate images and k8s platform versions provisioning for Jenkins jobsK8SPS-390Pavel Tankov
- Default containers for the mysql and haproxy pods should be the ones providing that serviceK8SPS-389
- PS-Operator cannot create ps-db-mysql and ps-db-orc StatefulSet when Resource Quota is enabled.K8SPS-388Resolved issue: K8SPS-388Pavel Tankov
- Add wait_for_delete() function because we use it in case of (upcoming) OpenShift supportK8SPS-387Eleonora Zinchenko
- Extract commands in docs to separate filesK8SPS-386dmitriy.kostiuk
Fix error management during relocation of replicas.
Description
Environment
AFFECTED CS IDs
Attachments
Smart Checklist
Details
Assignee
Yves TrudeauYves Trudeau(Deactivated)Reporter
Pep PlaPep PlaLabels
Needs QA
YesComponents
Fix versions
Priority
Medium
Details
Details
Assignee
Reporter
Labels
Needs QA
Components
Fix versions
Priority
Smart Checklist
Smart Checklist
Smart Checklist
Activity
Yves TrudeauJune 19, 2023 at 1:58 PM
Hi,
I agree, I modified the patch to "record" the errors is any and return them without breaking the logic. I'll initiate a new pull request when Jenkins will have passed over the new patch.
Kamil HolubickiJune 5, 2023 at 4:59 PM
Hi, I analyzed briefly this function and I'm not sure what is the actual problem to be fixed.
From the description of the problem:
If moveReplicasViaGTID does not move any replica (For whatever reason, perhaps something happens and there is an error moving the root of the replica tree) there is no return and the function will continue returning the confusing message.
moveReplicasViaGTID returns err and errs
err - is not nil only if the function fails to move all replicas (no replicas moved). Otherwise it is nil
errs - it contains errors related to replicas which were not moved
movedReplicas - self explanatory
unmovedReplicas - replicas for which movement ended up with error. The error is already in errs.
I see that it is possible that moveReplicasViaGTID() does not move any replica, and returns err and errs (internally if replica movement ends up with error, the replica is added to unmovedReplicas set and error is added to errs set). So it seems to be "normal" situation when no replica was moved and we have a bunch of errors. In such a case relocateReplicasInternal() should just try with other methods than 'via GTID'
I see no reason why we should return in this case.
In such a case we go to the code below, in the section 'Pseudo GTID', which seems to be fine.
Moreover, relocateReplicasInternal() is designed in a way, that it tries to move replicas in several ways, and if some replica is not able to be relocated ulimately, we end up with the error:
'Relocating %+v replicas of %+v below %+v turns to be too complex;'
which seems to be the desired behavior.
I agree that we lose error information from moveReplicasViaGTID() call and if any 'non GTID' fallback is not able to move the replicas we just write Relocating %+v replicas of %+v below %+v turns to be too complex;' error.
Is the intention of the fix to cache the intermediate error set returned by moveReplicasViaGTID (errs) and append it to the final error returned by the function in case of inability to move the replicas?
Or the intention is to return from relocateReplicasInternal() (do not go to 'non GTID' part) in case of moveReplicasViaGTID() returning err!=nil (it is only returned if all replicas failed to move). If yes, why?
Yves TrudeauJune 1, 2023 at 4:22 PM
Fixed by https://github.com/percona/orchestrator/pull/20
With the fix, the error returned includes the sub-errors, uncovering why the operation failed.
Sveta SmirnovaSeptember 27, 2022 at 1:11 PMEdited
Fix for this needs to be better logging. log_slave_updates=OFF is just one way to cause this error, there are could be dozens of others.
Aaditya DubeySeptember 27, 2022 at 12:29 PM
Hi @Pep Pla,
Thank you for the report.
We have repeated the this "Relocating N replicas of X below Y turns to be too complex, please do it manually" error by performing below steps:
Set-up classic master slave topology where we have 1 master having 5 slaves.
We turned off log_slave_updates in one of the slave.
We tried to promote that slave as master and encounter the reported error, Please check the attached screen recording for more details.
We're seeing the message: "Relocating N replicas of X below Y turns to be too complex, please do it manually"
It only appears once in the code in the function relocateReplicasInternal.
I think this is the problematic code
If moveReplicasViaGTID does not move any replica (For whatever reason, perhaps something happens and there is an error moving the root of the replica tree) there is no return and the function will continue returning the confusing message. This is the last message that appears in the graphical interface.
This piece of code is not checking the possible errors: err and errs.