RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled

Description

A customer runs multiple clusters of orchestrator to manage discovery of many mysql hosts. Two of three clusters are set with recovery disabled globally. However, the disabled clusters will still call RestartReplicationQuick and perform recovery.

 

The flow of the code is:

executeCheckAndRecoverFunction

    -> runEmergentOperations

      -> emergentlyRestartReplicationOnTopologyInstanceReplicas

         -> emergentlyRestartReplicationOnTopologyInstance

            -> emergentlyRestartReplicationOnTopologyInstance

               -> RestartReplicationQuick

    -> IsRecoveryDisabled

 

There is no check around the call to RestartReplicationQuick for IsRecoveryDisabled:

Suggested fix:

Perform IsRecoveryDisabled check prior to initiating RestartReplicationQuick.

Environment

None

AFFECTED CS IDs

CS0050200

Activity

Kamil Holubicki 
November 7, 2024 at 2:07 PM

Kamil Holubicki 
November 7, 2024 at 1:59 PM

Hints for QA:

To trigger this case, we need:

  1. working source node, but Orchestrator not being able to connect to it

  2. lagging replica

Steps to reproduce:

  1. Create 2-nodes replication chain: source (port 4000)→ replica (port 5000)

  2. Allow Orchestrator to discover the topology

  3. Disable recovery globally: orchestrator-client -c disable-global-recoveries

  4. Start user session on source

    1. CREATE TABLE t(id INT AUTO_INCREMENT PRIMARY KEY, v DOUBLE, d DATETIME);

    2. SET binlog_format = STATEMENT;

    3. INSERT INTO t (v, d) SELECT 1, NOW() FROM information_schema.INNODB_METRICS LIMIT 10;

    4. do not close the connection

  5. on the source node: set global max_connections=2;

  6. restart the orchestrator. It will indicate that there is a dead master. Error log of orchestrator will show “too many connections” error returned buy source node

  7. Execute UPDATE t SET v = SLEEP(15), d = now() WHERE id = 1; in user session (the one from point 4)

  8. In a while, Orchestrator should detect UnreachableMasterWithLaggingReplicas (indicated in logs)

  9. Orchestrator should attempt to restart replica. It will be visible in Orchestrator logs:
    2024-11-07 13:55:47 INFO stop slave io_thread on kamil-Latitude-5531:5000 as part of RestartReplicationQuick
    2024-11-07 13:57:20 INFO start slave io_thread on kamil-Latitude-5531:5000 as part of RestartReplicationQuick

The problem is point 8. When recovery is disabled globally, restarting the replica should not be attempted.

Kamil Holubicki 
November 7, 2024 at 1:38 PM

It will go to Orchestrator v3.2.6-15 (PS distribution 8.0.40)

Done

Details

Assignee

Reporter

Planned Version/s

Needs QA

In progress time

Time tracking

No time logged1d remaining

Components

Sprint

Priority

Created October 24, 2024 at 10:33 PM
Updated January 14, 2025 at 10:18 AM
Resolved November 12, 2024 at 9:38 AM