LP #1491682: GTID replications fails to resume after master disconnect

Description

**Reported in Launchpad by monty solomon last update 05-09-2015 06:52:00

The slave was unable to connect to the master due to network issues. Once the network issues were resolved the slave did not continue replication even though SHOW SLAVE STATUS listed "Waiting for master to send event"

We sometimes see similar issues when the master is restarted.

The issue appears to be specific to GTID replication.

Repeated executions of SHOW SLAVE STATUS\G contained the following excerpt.

Slave_IO_State: Waiting for master to send event
Connect_Retry: 15
Master_Log_File: bin.010862
Read_Master_Log_Pos: 2528103
Relay_Log_File: relay.001675
Relay_Log_Pos: 307
Relay_Master_Log_File: bin.010862
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 2528103
Relay_Log_Space: 657
Seconds_Behind_Master: 0
Master_Server_Id: 1156476169
Master_UUID: 9f9bf362-83e9-11e4-a45f-0acf4ba07bb2
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
Master_Retry_Count: 86400
Retrieved_Gtid_Set: 9f9bf362-83e9-11e4-a45f-0acf4ba07bb2:1-1233067
Executed_Gtid_Set: 90fe53a8-1a31-11e5-b854-122ff10e879b:1-6,
9f9bf362-83e9-11e4-a45f-0acf4ba07bb2:1-1233067

After executing "STOP SLAVE; START SLAVE;" the slave resumed replication and Seconds_Behind_Master changed from 0 to 3510

Slave_IO_State: Waiting for master to send event
Connect_Retry: 15
Master_Log_File: bin.010864
Read_Master_Log_Pos: 11834054
Relay_Log_File: relay.001679
Relay_Log_Pos: 16023
Relay_Master_Log_File: bin.010863
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 15825
Relay_Log_Space: 24322598
Seconds_Behind_Master: 3510
Master_Server_Id: 1156476169
Master_UUID: 9f9bf362-83e9-11e4-a45f-0acf4ba07bb2
Slave_SQL_Running_State: System lock
Master_Retry_Count: 86400
Retrieved_Gtid_Set: 9f9bf362-83e9-11e4-a45f-0acf4ba07bb2:1-1239713
Executed_Gtid_Set: 90fe53a8-1a31-11e5-b854-122ff10e879b:1-6,
9f9bf362-83e9-11e4-a45f-0acf4ba07bb2:1-1233654

Environment

None

Attachments

4
  • 30 Jul 2018, 12:32 PM
  • 30 Jul 2018, 11:57 AM
  • 30 Jul 2018, 11:57 AM
  • 30 Jul 2018, 11:55 AM

Smart Checklist

Activity

Show:

Julia Vural December 25, 2024 at 9:54 AM

Doesn't reproduce on newer builds.

Lalit Choudhary July 30, 2018 at 12:31 PM
Edited

Percona server 5.7.22 also has similar issue.  Log output attached.

Steps to reproduce:

  1. Have Master - Slave

  2. Enable Performance_schema instruments on slave.

 

/* Enable all instrumentation, but only for the slave SQL thread */ SET @sql_thread = (SELECT thread_id FROM performance_schema.threads WHERE name = 'thread/sql/slave_sql'); UPDATE performance_schema.threads SET instrumented = 'NO' WHERE thread_id != @sql_thread; UPDATE performance_schema.setup_instruments SET enabled = 'YES', timed = 'YES'; UPDATE performance_schema.setup_consumers SET enabled = 'YES' WHERE name LIKE '%history_long' OR name LIKE '%current';

3. Start Bulk INSERT on MASTER, after some time stop MASTER and Start again and resume insert operation.

4. start slave on SLAVE (if it's not started)

5. Check SHOW SLAVE  STATUS\G   and you will see Seconds_Behind_Master will continue to increase / after some time it will halt at the constant value. 

6. PS event_stages and statement information will show large wait time for System lock.

/* Let the SQL thread run for a minute or so */ /* Get stage and statement raw details */ SELECT event_id, event_name, timer_wait, lock_time FROM performance_schema.events_statements_history_long WHERE thread_id = @sql_thread; SELECT event_name, timer_wait, nesting_event_id FROM performance_schema.events_stages_history_long WHERE thread_id = @sql_thread; /* Get some thread wait summary data */ SELECT event_name, count_star, sum_timer_wait FROM performance_schema.events_waits_summary_by_thread_by_event_name WHERE thread_id = @sql_thread AND sum_timer_wait > 0 ORDER BY sum_timer_wait DESC;

 

Lalit Choudhary July 30, 2018 at 11:54 AM
Edited

Hi Monty,

Thank you for the report. 

This issue is similar to upstream MySQL Bug#72131

where the slave finds it very hard to catch up and there is a large wait time for System lock.

Output PS events logs are attached along with show processlist_; show engine_ innodb status\G show slave status\G output.

Tested with Percona Server 5.6.40 

 

lpjirasync January 24, 2018 at 8:29 AM

**Comment from Launchpad by: monty solomon on: 05-09-2015 03:56:12

Attaching my.cnf and 001-basic-config.cnf

my.cnf includes 001-basic-config.cnf

lpjirasync January 24, 2018 at 8:29 AM

**Comment from Launchpad by: Valerii Kravchuk on: 04-09-2015 09:38:10

Please, send your my.cnf file content from slave.

Done

Details

Assignee

Reporter

Affects versions

Priority

Smart Checklist

Created January 24, 2018 at 8:29 AM
Updated December 25, 2024 at 9:54 AM
Resolved December 25, 2024 at 9:54 AM