Lock leak in PXC could result in deadlock

Description

Lock leak could occur in PXC - 5.7.

Environment

It was also observed that the PXC node acting as a replica could enter into a deadlock due to lock leak, where the applier thread stucks on Waiting for slave mutex on exit

mysql> show processlist; +----+-------------+-----------------+------+---------+------+-------------------------------------+-------------------+---------+-----------+---------------+ | Id | User | Host | db | Command | Time | State | Info | Time_ms | Rows_sent | Rows_examined | +----+-------------+-----------------+------+---------+------+-------------------------------------+-------------------+---------+-----------+---------------+ | 1 | system user | | NULL | Sleep | 709 | wsrep: aborter idle | NULL | 709726 | 0 | 0 | | 2 | system user | | NULL | Sleep | 150 | wsrep: committed write set (138302) | NULL | 135824 | 0 | 0 | | 13 | root | localhost:35282 | test | Query | 195 | checking permissions | start slave | 195292 | 0 | 0 | | 14 | system user | | NULL | Connect | 381 | Waiting for master to send event | NULL | 381007 | 0 | 0 | | 15 | system user | | NULL | Connect | 240 | Waiting for slave mutex on exit | NULL | 240706 | 0 | 0 | | 20 | root | localhost:49788 | test | Query | 11 | starting | show slave status | 11327 | 0 | 0 | | 23 | root | localhost:58280 | test | Query | 0 | starting | show processlist | 0 | 0 | 0 | +----+-------------+-----------------+------+---------+------+-------------------------------------+-------------------+---------+-----------+---------------+ 7 rows in set (0.00 sec)

When observed in gdb, we can observe that the applier thread is waiting for the lock

(gdb) t [Current thread is 32 (Thread 0x7ecc100ee640 (LWP 367680))] (gdb) bt #0 futex_wait (private=0, expected=2, futex_word=0x5f910a753278) at ../sysdeps/nptl/futex-internal.h:146 #1 __GI___lll_lock_wait (futex=futex@entry=0x5f910a753278, private=0) at ./nptl/lowlevellock.c:49 #2 0x00007ecc18e98167 in ___pthread_mutex_lock (mutex=0x5f910a753278) at ./nptl/pthread_mutex_lock.c:145 #3 0x00005f91086e8fa1 in native_mutex_lock (mutex=0x5f910a753278) at /home/venki/work/pxc/merge/57/include/thr_mutex.h:91 #4 0x00005f91086e8fd5 in my_mutex_lock (mp=0x5f910a753278) at /home/venki/work/pxc/merge/57/include/thr_mutex.h:189 #5 0x00005f91086e912e in inline_mysql_mutex_lock (that=0x5f910a753278, src_file=0x5f91091f1938 "/home/venki/work/pxc/merge/57/sql/rpl_slave.cc", src_line=8037) at /home/venki/work/pxc/merge/57/include/mysql/psi/mysql_thread.h:718 #6 0x00005f9108701e84 in handle_slave_sql (arg=0x5f910a6fd880) at /home/venki/work/pxc/merge/57/sql/rpl_slave.cc:8037 #7 0x00005f910878d8ea in pfs_spawn_thread (arg=0x7ecb8823e540) at /home/venki/work/pxc/merge/57/storage/perfschema/pfs.cc:2198 #8 0x00007ecc18e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #9 0x00007ecc18f26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

On printing the mutex to know the owner of the mutex, we can observe that it is waiting for thread owner 367680

(gdb) p *mutex $5 = {__data = {__lock = 2, __count = 0, __owner = 367680, __nusers = 1, __kind = 3, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000@\234\005\000\001\000\000\000\003", '\000' <repeats 22 times>, __align = 2}

But the thread 367680 is none other than the applier thread itself

(gdb) t [Current thread is 32 (Thread 0x7ecc100ee640 (LWP 367680))]

 

 

Steps to reproduce: same steps for https://perconadev.atlassian.net/browse/PXC-4404 can be used to reproduce this bug.

  1. Deploy 8 nodes 5.7.44

  2. Setup node1 as async replica of some external 5.7.44. Make sure to set wsrep_preordered=ON

  3. Start write only sysbench using “test” db on node8

  4. Start write only sysbench using “test2” db on external async source node.

  5. Force node2 of the cluster to “pause” frequently with

KILL -SIGSTOP $PID_OF_NODE2; sleep 5; KILL -SIGCONT $PID_OF_NODE2; KILL -SIGSTOP $PID_OF_NODE2; sleep 5; KILL -SIGCONT $PID_OF_NODE2; KILL -SIGSTOP $PID_OF_NODE2; sleep 5; KILL -SIGCONT $PID_OF_NODE2;

relates to

Activity

Venkatesh Prasad July 29, 2024 at 3:17 PM
Edited

Found the root cause: It is expected that function apply_event_and_update_pos() must release the rli->data_lock() Ref: here

@sql/rpl_slave.cc in exec_relay_log_event() function === exec_res= apply_event_and_update_pos(ptr_ev, thd, rli); /* Note: the above call to apply_event_and_update_pos executes mysql_mutex_unlock(&rli->data_lock); */ ===

 

However, there is a missing unlock() call when wsrep_preordered encounters a replication error. This could result in lock leak causing the above mentioned deadlock.

 

The below patch fixes the above problem.

diff --git a/sql/rpl_slave.cc b/sql/rpl_slave.cc index f094f66ba0e..2aa27757ed6 100644 --- a/sql/rpl_slave.cc +++ b/sql/rpl_slave.cc @@ -4937,6 +4937,7 @@ apply_event_and_update_pos(Log_event** ptr_ev, THD* thd, Relay_log_info* rli) { WSREP_WARN("failed to cancel preordered write set"); } + mysql_mutex_unlock(&rli->data_lock); DBUG_RETURN(SLAVE_APPLY_EVENT_AND_UPDATE_POS_APPLY_ERROR); } } @@ -4947,6 +4948,7 @@ apply_event_and_update_pos(Log_event** ptr_ev, THD* thd, Relay_log_info* rli) &data, 1, 1)) != WSREP_OK) { WSREP_ERROR("wsrep preordered collect failed: %d", err); + mysql_mutex_unlock(&rli->data_lock); DBUG_RETURN(SLAVE_APPLY_EVENT_AND_UPDATE_POS_APPLY_ERROR); } @@ -4966,10 +4968,12 @@ apply_event_and_update_pos(Log_event** ptr_ev, THD* thd, Relay_log_info* rli) thd->wsrep_po_cnt= 0; wsrep_uuid_t source; memcpy(source.data, thd->wsrep_po_sid.bytes, sizeof(source.data)); if ((err= wsrep->preordered_commit(wsrep, &thd->wsrep_po_handle, &source, flags, 1, true)) != WSREP_OK) { WSREP_ERROR("Failed to commit preordered event: %d", err); + mysql_mutex_unlock(&rli->data_lock); DBUG_RETURN(SLAVE_APPLY_EVENT_AND_UPDATE_POS_APPLY_ERROR); } }

Venkatesh Prasad July 29, 2024 at 9:36 AM

Same steps for https://perconadev.atlassian.net/browse/PXC-4404 can be used to reproduce this bug.

  1. Deploy 8 nodes 5.7.44

  2. Setup node1 as async replica of some external 5.7.44. Make sure to set wsrep_preordered=ON

  3. Start write only sysbench using “test” db on node8

  4. Start write only sysbench using “test2” db on external async source node.

  5. Force node2 of the cluster to “pause” frequently with

KILL -SIGSTOP $PID_OF_NODE2; sleep 5; KILL -SIGCONT $PID_OF_NODE2; KILL -SIGSTOP $PID_OF_NODE2; sleep 5; KILL -SIGCONT $PID_OF_NODE2; KILL -SIGSTOP $PID_OF_NODE2; sleep 5; KILL -SIGCONT $PID_OF_NODE2;
Won't Do

Details

Assignee

Reporter

Needs QA

Yes

Affects versions

Priority

Smart Checklist

Created July 29, 2024 at 8:55 AM
Updated October 2, 2024 at 7:39 AM
Resolved October 2, 2024 at 7:39 AM

Flag notifications