Deadlock with concurrent START SLAVE and backup using PXB

General

Escalation

General

Escalation

Description

Server could enter into a deadlock when START SLAVE is issued immediately when PXB process starts to take a backup of the server.

It looks locks are acquired in opposite order during

START SLAVE and
Query SELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_status executed by PXB.

Processlist info:

mysql> show processlist;
+----+--------------------+-----------------+------+---------+------+-----------------------------------------+--------------------------------------------------------------------------------------------+---------+-----------+---------------+
| Id | User               | Host            | db   | Command | Time | State                                   | Info                                                                                       | Time_ms | Rows_sent | Rows_examined |
+----+--------------------+-----------------+------+---------+------+-----------------------------------------+--------------------------------------------------------------------------------------------+---------+-----------+---------------+
|  1 | system user        |                 | NULL | Sleep   |   71 | wsrep aborter idle                      | NULL                                                                                       |   71420 |         0 |             0 |
|  2 | system user        |                 | NULL | Sleep   |   71 | innobase_commit_low (-1)                | NULL                                                                                       |   71420 |         0 |             0 |
|  7 | event_scheduler    | localhost       | NULL | Daemon  |   69 | Waiting on empty queue                  | NULL                                                                                       |   68561 |         0 |             0 |
| 11 | root               | localhost:33142 | NULL | Query   |   60 | Waiting for replica thread to start     | START REPLICA                                                                              |   60010 |         0 |             0 |
| 14 | system user        | connecting host | NULL | Connect |   60 | Connecting to source                    | NULL                                                                                       |   60004 |         0 |             0 |
| 15 | system user        | connecting host | NULL | Query   |   60 | Waiting for the next event in relay log | NULL                                                                                       |   60004 |         0 |             0 |
| 16 | mysql.pxc.sst.user | localhost       | NULL | Query   |   53 | executing                               | SELECT server_uuid, local, replication, storage_engines FROM performance_schema.log_status |   53019 |         0 |             0 |
| 18 | root               | localhost:45368 | NULL | Query   |    0 | init                                    | show processlist                                                                           |       0 |         0 |             0 |
+----+--------------------+-----------------+------+---------+------+-----------------------------------------+--------------------------------------------------------------------------------------------+---------+-----------+---------------+
8 rows in set (0.00 sec)

This issue was seen in PXC, so we could see query running as the user mysql.pxc.sst.user getting stuck for 53 seconds, and the START SLAVE being stuck from 60seconds

Stacktrace file:

Environment

None

Attachments

09 May 2023, 06:28 AM

Linked issues

is caused by

PXC-3982

Getting error from Slave SQL thread when it is started before the PXC 8.0 node is ready to accept connections while joining the PXC cluster

Activity

Show:

Venkatesh Prasad May 9, 2023 at 8:19 AM

This is a result of https://jira.percona.com/browse/PXC-3982

Venkatesh Prasad May 9, 2023 at 7:41 AM

Deadlock summary:

SQL thread is holding rli->run_lock and is waiting for s_synced inside wait_until_state(). This will return only on successful SST. But SST is stuck at waiting for results from log_status query.
IO thred waiting for mi->run_lock in handle_slave_io. But mi->run_lock is held by START SLAVE.
START SLAVE thread waiting for rli->run_lock(lock_cond_sql) in start_slave_thread() which is held by SQL thread.
LOG_STATUS is waiting for channel_map_lock->wrlock() but it is held by START SLAVE thread.

SQL thread
holds: rli->run_lock acquired at line no 7128 in rpl_replica.cc
waits for: SST/PXB

IO thread:
holds:
waits for: mi->run_lock, held by START SLAVE

START SLAVE:
holds:
1. mi->run_lock, this was reacquired after starting IO thread, usually released in unlock_slave_threads() in the end of start_slave().
2. channel_map_lock held in start_slave_cmd() line no 747 in rpl_replica.cc
waits for: rli->run_lock, held by SQL thread

SST/PXB:
holds: nothing
waits for: channel_map_lock, held by START SLAVE

Done

Details

Assignee

Unassigned

Reporter

Venkatesh Prasad

Needs Review

Yes

Needs QA

Fix versions

8.0.34-26 (Q3 2023)

Affects versions

8.0.x

Priority

Medium

Smart Checklist

Created May 8, 2023 at 11:05 AM

Updated March 6, 2024 at 8:41 PM

Resolved August 21, 2023 at 12:22 PM

Configure