Mutex deadlock around processlist, global status and global variables plus binlogs
Description
Environment
AFFECTED CS IDs
Attachments
- 04 Aug 2021, 02:42 AM
- 04 Aug 2021, 01:21 AM
Activity

Marcelo Altmann August 4, 2021 at 3:02 PM
I've analyzed the threads executing a single code path (multiple threads executing the same stack can be waiting on the same mutex, but not holding the same one)
I've got this:
1 __lll_lock_wait(libpthread.so.0),_L_lock_883(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:91),my_mutex_lock(thr_mutex.h:91),inline_mysql_mutex_lock(thr_mutex.h:91),MYSQL_BIN_LOG::change_stage(thr_mutex.h:91),MYSQL_BIN_LOG::ordered_commit(binlog.cc:10278),MYSQL_BIN_LOG::commit(binlog.cc:9519),ha_commit_trans(handler.cc:1839),trans_commit(transaction.cc:246),mysql_execute_command(sql_parse.cc:4563),mysql_parse(sql_parse.cc:5927),dispatch_command(sql_parse.cc:1539),do_command(sql_parse.cc:1060),handle_connection(connection_handler_per_thread.cc:325),pfs_spawn_thread(pfs.cc:2198),start_thread(libpthread.so.0),clone(libc.so.6)
waits on LOCK_log at binlog.cc:9846
1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:91),my_mutex_lock(thr_mutex.h:91),inline_mysql_mutex_lock(thr_mutex.h:91),table_global_status::get_row_count(thr_mutex.h:91),ha_perfschema::info(ha_perfschema.cc:401),JOIN::init_planner_arrays(sql_optimizer.cc:5290),JOIN::make_join_plan(sql_optimizer.cc:5112),JOIN::optimize(sql_optimizer.cc:394),st_select_lex::optimize(sql_select.cc:1018),st_select_lex_unit::optimize(sql_union.cc:711),TABLE_LIST::optimize_derived(sql_derived.cc:215),JOIN::optimize(sql_optimizer.cc:209),st_select_lex::optimize(sql_select.cc:1018),handle_query(sql_select.cc:172),execute_sqlcom_select(sql_parse.cc:5490),mysql_execute_command(sql_parse.cc:2964),mysql_parse(sql_parse.cc:5927),dispatch_command(sql_parse.cc:1539),do_command(sql_parse.cc:1060),handle_connection(connection_handler_per_thread.cc:325),pfs_spawn_thread(pfs.cc:2198),start_thread(libpthread.so.0),clone(libc.so.6)
waits on LOCK_status at table_global_status.cc:92
1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:91),my_mutex_lock(thr_mutex.h:91),inline_mysql_mutex_lock(thr_mutex.h:91),System_variable::init(thr_mutex.h:91),PFS_system_variable_cache::do_materialize_session(pfs_variable.cc:411),materialize_session(pfs_variable.h:592),table_variables_by_thread::rnd_pos(pfs_variable.h:592),ha_perfschema::rnd_pos(ha_perfschema.cc:389),handler::ha_rnd_pos(handler.cc:3199),rr_from_tempfile(records.cc:551),sub_select(sql_executor.cc:1287),do_select(sql_executor.cc:957),JOIN::exec(sql_executor.cc:957),handle_query(sql_select.cc:192),execute_sqlcom_select(sql_parse.cc:5490),mysql_execute_command(sql_parse.cc:3016),mysql_parse(sql_parse.cc:5927),dispatch_command(sql_parse.cc:1539),do_command(sql_parse.cc:1060),handle_connection(connection_handler_per_thread.cc:325),pfs_spawn_thread(pfs.cc:2198),start_thread(libpthread.so.0),clone(libc.so.6)
waits on LOCK_thd_sysvar at pfs_variable.cc:542
1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:91),my_mutex_lock(thr_mutex.h:91),inline_mysql_mutex_lock(thr_mutex.h:91),MYSQL_BIN_LOG::publish_coordinates_for_global_status(thr_mutex.h:91),MYSQL_BIN_LOG::ordered_commit(binlog.cc:10341),MYSQL_BIN_LOG::commit(binlog.cc:9519),ha_commit_trans(handler.cc:1839),trans_commit(transaction.cc:246),mysql_execute_command(sql_parse.cc:4563),mysql_parse(sql_parse.cc:5927),dispatch_command(sql_parse.cc:1539),do_command(sql_parse.cc:1060),handle_connection(connection_handler_per_thread.cc:325),pfs_spawn_thread(pfs.cc:2198),start_thread(libpthread.so.0),clone(libc.so.6)
waits for LOCK_status at binlog.cc:10676
1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:91),my_mutex_lock(thr_mutex.h:91),inline_mysql_mutex_lock(thr_mutex.h:91),MYSQL_BIN_LOG::commit(binlog.cc:9248),ha_commit_trans(handler.cc:1839),trans_commit(transaction.cc:246),fix_autocommit(sys_vars.cc:4811),sys_var::update(set_var.cc:204),set_var::update(set_var.cc:843),sql_set_variables(set_var.cc:679),mysql_execute_command(sql_parse.cc:3944),mysql_parse(sql_parse.cc:5927),dispatch_command(sql_parse.cc:1539),do_command(sql_parse.cc:1060),handle_connection(connection_handler_per_thread.cc:325),pfs_spawn_thread(pfs.cc:2198),start_thread(libpthread.so.0),clone(libc.so.6)
waits on LOCK_thd_data at binlog.cc:9248 (this is changing a session/global variable via SET)
1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:91),my_mutex_lock(thr_mutex.h:91),inline_mysql_mutex_lock(thr_mutex.h:91),Log_in_use::operator()(thr_mutex.h:91),operator(mysqld_thd_manager.cc:53),for_each<THD**,(mysqld_thd_manager.cc:53),Global_THD_manager::do_for_all_thd(mysqld_thd_manager.cc:53),log_in_use(binlog.cc:3008),MYSQL_BIN_LOG::purge_logs_before_date(binlog.cc:3008),MYSQL_BIN_LOG::purge(binlog.cc:8191),MYSQL_BIN_LOG::ordered_commit(binlog.cc:10508),MYSQL_BIN_LOG::commit(binlog.cc:9519),ha_commit_trans(handler.cc:1839),trans_commit(transaction.cc:246),mysql_execute_command(sql_parse.cc:4563),mysql_parse(sql_parse.cc:5927),dispatch_command(sql_parse.cc:1539),do_command(sql_parse.cc:1060),handle_connection(connection_handler_per_thread.cc:325),pfs_spawn_thread(pfs.cc:2198),start_thread(libpthread.so.0),clone(libc.so.6)
waits on LOCK_thd_data at binlog.cc:2982
1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:91),my_mutex_lock(thr_mutex.h:91),inline_mysql_mutex_lock(thr_mutex.h:91),List_process_list::operator()(thr_mutex.h:91),operator(mysqld_thd_manager.cc:53),for_each<THD**,(mysqld_thd_manager.cc:53),Global_THD_manager::do_for_all_thd_copy(mysqld_thd_manager.cc:53),mysqld_list_processes(sql_show.cc:2445),mysql_execute_command(sql_parse.cc:3875),mysql_parse(sql_parse.cc:5927),dispatch_command(sql_parse.cc:1539),do_command(sql_parse.cc:1060),handle_connection(connection_handler_per_thread.cc:325),pfs_spawn_thread(pfs.cc:2198),start_thread(libpthread.so.0),clone(libc.so.6)
waits on LOCK_thd_data at sql_show.cc:2336
1 __lll_lock_wait(libpthread.so.0),_L_lock_1093(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),native_mutex_lock(thr_mutex.h:91),my_mutex_lock(thr_mutex.h:91),inline_mysql_mutex_lock(thr_mutex.h:91),Global_THD_manager::do_for_all_thd(thr_mutex.h:91),PFS_connection_iterator::visit_global(pfs_visitor.cc:123),PFS_status_variable_cache::do_materialize_global(pfs_variable.cc:970),materialize_global(pfs_variable.h:525),table_global_status::rnd_init(pfs_variable.h:525),ha_perfschema::rnd_init(ha_perfschema.cc:331),handler::ha_rnd_init(handler.cc:3109),init_read_record(records.cc:322),join_init_read_record(sql_executor.cc:2503),sub_select(sql_executor.cc:1284),do_select(sql_executor.cc:957),JOIN::exec(sql_executor.cc:957),TABLE_LIST::materialize_derived(sql_derived.cc:333),join_materialize_derived(sql_executor.cc:2525),QEP_TAB::prepare_scan(sql_executor.cc:1338),sub_select(sql_executor.cc:1238),do_select(sql_executor.cc:957),JOIN::exec(sql_executor.cc:957),handle_query(sql_select.cc:192),execute_sqlcom_select(sql_parse.cc:5490),mysql_execute_command(sql_parse.cc:2964),mysql_parse(sql_parse.cc:5927),dispatch_command(sql_parse.cc:1539),do_command(sql_parse.cc:1060),handle_connection(connection_handler_per_thread.cc:325),pfs_spawn_thread(pfs.cc:2198),start_thread(libpthread.so.0),clone(libc.so.6)
waits on LOCK_thd_list
This shows which mutex are involved here and what each thread waits.
Those threads also hold a mutex that another thread is waiting for. However, we have loops such as Global_THD_manager::do_for_all_thd
which will loop through all threads. This will stuck in the middle of the loop, making it impossible to know via pt-pmp which threads have already had their LOCK_thd_data taken and which ones we are waiting to.
It has similar locks involved as https://bugs.mysql.com/bug.php?id=92108
Workaround:
We see a thread doing binlog rotation at commit stage, increasing the binlog expire days can delay the problem (like have it to max value during pick hours, and reduce it to normal value during off-hours so roration can happen).
In summary, this meets the criteria for an internal deadlock.
We require a coredump in order to conduct a full analyzes of this case.
The deadlock somewhat resembles https://bugs.mysql.com/bug.php?id=92108, but is certainly not the same. The binlog was being rotated and other threads are processing SHOW PROCESSLIST / SHOW GLOB VARIABLES / SHOW GLOBAL STATUS and also querying performance schema per-user variables. See attached
and
for full details.
The issue happens during period of very heavy load.