RestartReplicationQuick called even from Orchestrator cluster where recovery has been globally disabled

General

Escalation

General

Escalation

Description

A customer runs multiple clusters of orchestrator to manage discovery of many mysql hosts. Two of three clusters are set with recovery disabled globally. However, the disabled clusters will still call RestartReplicationQuick and perform recovery.

The flow of the code is:

executeCheckAndRecoverFunction

-> runEmergentOperations

-> emergentlyRestartReplicationOnTopologyInstanceReplicas

-> emergentlyRestartReplicationOnTopologyInstance

-> RestartReplicationQuick

-> IsRecoveryDisabled

There is no check around the call to RestartReplicationQuick for IsRecoveryDisabled:

Suggested fix:

Perform IsRecoveryDisabled check prior to initiating RestartReplicationQuick.

Environment

None

AFFECTED CS IDs

CS0050200

Linked work items

relates to

DISTMYSQL-465

Orchestrator RestartReplicationQuick fails with Error 1065 (query was empty)

Activity

Kamil Holubicki
November 7, 2024 at 2:07 PM

Kamil Holubicki
November 7, 2024 at 1:59 PM

Hints for QA:

To trigger this case, we need:

working source node, but Orchestrator not being able to connect to it
lagging replica

Steps to reproduce:

Create 2-nodes replication chain: source (port 4000)→ replica (port 5000)
Allow Orchestrator to discover the topology
Disable recovery globally: orchestrator-client -c disable-global-recoveries
Start user session on source
1. CREATE TABLE t(id INT AUTO_INCREMENT PRIMARY KEY, v DOUBLE, d DATETIME);
2. SET binlog_format = STATEMENT;
3. INSERT INTO t (v, d) SELECT 1, NOW() FROM information_schema.INNODB_METRICS LIMIT 10;
4. do not close the connection
on the source node: set global max_connections=2;
restart the orchestrator. It will indicate that there is a dead master. Error log of orchestrator will show “too many connections” error returned buy source node
Execute UPDATE t SET v = SLEEP(15), d = now() WHERE id = 1; in user session (the one from point 4)
In a while, Orchestrator should detect UnreachableMasterWithLaggingReplicas (indicated in logs)
Orchestrator should attempt to restart replica. It will be visible in Orchestrator logs:
2024-11-07 13:55:47 INFO stop slave io_thread on kamil-Latitude-5531:5000 as part of RestartReplicationQuick
2024-11-07 13:57:20 INFO start slave io_thread on kamil-Latitude-5531:5000 as part of RestartReplicationQuick

The problem is point 8. When recovery is disabled globally, restarting the replica should not be attempted.

Kamil Holubicki
November 7, 2024 at 1:38 PM

It will go to Orchestrator v3.2.6-15 (PS distribution 8.0.40)

Resize issue view side panel

Done

Details

Assignee

Kamil Holubicki

Reporter

Dov Endress

Labels

cs-tag-004

Planned Version/s

8.0.40(PS)

8.4.3(PS)

Needs QA

In progress time

Time tracking

No time logged1d remaining

Components

Sprint