Issues

Select view

Select search mode

 
50 of 405

Fix error management during relocation of replicas.

Done

Description

We're seeing the message: "Relocating N replicas of X below Y turns to be too complex, please do it manually"

It only appears once in the code in the function relocateReplicasInternal.

I think this is the problematic code

movedReplicas, unmovedReplicas, err, errs := moveReplicasViaGTID(replicas, other, nil)

if len(movedReplicas) == len(replicas) {
  // Moved (or tried moving) everything via GTID
   return movedReplicas, err, errs
} else if len(movedReplicas) > 0 {
   // something was moved via GTID; let's try further on
   return relocateReplicasInternal(unmovedReplicas, instance, other)
{{  }}}
  // Otherwise nothing was moved via GTID. Maybe we don't have any GTIDs, we continue.

If moveReplicasViaGTID does not move any replica (For whatever reason, perhaps something happens and there is an error moving the root of the replica tree) there is no return and the function will continue returning the confusing message. This is the last message that appears in the graphical interface.

This piece of code is not checking the possible errors: err and errs.

Environment

None

AFFECTED CS IDs

CS0027310

Attachments

2
  • 01 Jun 2023, 04:21 PM
  • 27 Sep 2022, 12:29 PM

Confluence content

mentioned on

Smart Checklist

Details

Assignee

Reporter

Needs QA

Yes

Components

Fix versions

Priority

Smart Checklist

Created June 7, 2022 at 7:22 PM
Updated May 22, 2024 at 10:44 AM
Resolved February 29, 2024 at 6:29 AM

Activity

Show:

Yves TrudeauJune 19, 2023 at 1:58 PM

Hi,

   I agree, I modified the patch to "record" the errors is any and return them without breaking the logic.  I'll initiate a new pull request when Jenkins will have passed over the new patch.

Kamil HolubickiJune 5, 2023 at 4:59 PM

Hi, I analyzed briefly this function and I'm not sure what is the actual problem to be fixed.

From the description of the problem:

If moveReplicasViaGTID does not move any replica (For whatever reason, perhaps something happens and there is an error moving the root of the replica tree) there is no return and the function will continue returning the confusing message.

moveReplicasViaGTID returns err and errs

err - is not nil only if the function fails to move all replicas (no replicas moved). Otherwise it is nil

errs - it contains errors related to replicas which were not moved

movedReplicas - self explanatory

unmovedReplicas - replicas for which movement ended up with error. The error is already in errs.

I see that it is possible that moveReplicasViaGTID() does not move any replica, and returns err and errs (internally if replica movement ends up with error, the replica is added to unmovedReplicas set and error is added to errs set). So it seems to be "normal" situation when no replica was moved and we have a bunch of errors. In such a case relocateReplicasInternal() should just try with other methods than 'via GTID'  

I see no reason why we should return in this case.

In such a case we go to the code below, in the section 'Pseudo GTID', which seems to be fine. 

Moreover,  relocateReplicasInternal() is designed in a way, that it tries to move replicas in several ways, and if some replica is not able to be relocated ulimately, we end up with the error:

'Relocating %+v replicas of %+v below %+v turns to be too complex;'

which seems to be the desired behavior.
 
I agree that we lose error information from moveReplicasViaGTID() call and if any 'non GTID' fallback is not able to move the replicas we just write Relocating %+v replicas of %+v below %+v turns to be too complex;' error.
 
Is the intention of the fix to cache the intermediate error set returned by moveReplicasViaGTID (errs) and append it to the final error returned by the function in case of inability to move the replicas?
Or the intention is to return from relocateReplicasInternal() (do not go to 'non GTID' part) in case of moveReplicasViaGTID() returning err!=nil (it is only returned if all replicas failed to move). If yes, why?

Yves TrudeauJune 1, 2023 at 4:22 PM

Fixed by https://github.com/percona/orchestrator/pull/20

With the fix, the error returned includes the sub-errors, uncovering why the operation failed.

Sveta SmirnovaSeptember 27, 2022 at 1:11 PM
Edited

Fix for this needs to be better logging. log_slave_updates=OFF is just one way to cause this error, there are could be dozens of others.

Aaditya DubeySeptember 27, 2022 at 12:29 PM

Hi ,

Thank you for the report.
We have repeated the this "Relocating N replicas of X below Y turns to be too complex, please do it manually" error by performing below steps:

  1. Set-up classic master slave topology where we have 1 master having 5 slaves.

  2. We turned off log_slave_updates in one of the slave.

  3. We tried to promote that slave as master and encounter the reported error, Please check the attached screen recording for more details.

Flag notifications