Server crashes during shutdown
Description
Environment
AFFECTED CS IDs
is duplicated by
Smart Checklist
Activity

Marcelo Altmann November 8, 2023 at 1:18 PM
This is a duplicate of https://jira.percona.com/browse/PS-8174 .
Bug has been fixed and sent to upstream

Marcelo Altmann August 13, 2022 at 12:16 AMEdited
Server is crashing at shutdown, specifically at buf_all_freed_instance . This function ensures that all pages can be replaced (eg.: do not have any thread waiting on it). Workflow is as follow:
buf_all_freed -> buf_all_freed_instance -> buf_flush_ready_for_replace -> buf_flush_ready_for_replace
We traverse all buffer pool instances and check each page in the LRU list to ensure that it's io_fix == BUF_IO_NONE, oldest_modification ==0 and buf_fix_count == 0.
On the coredump provided, we can see that we have 4 BP instances, using python script below, we can traverse the pages and check which BP has any page that is not ready to be freed:
import os
def get_number_of_elements(list):
count = 0
for element in list:
count += 1
return count
class printLRUList (gdb.Command):
"""Traverse LRU list from a given buffer pool instance and check if all pages are ready to be freed. Print the pages that are not in a free state"""
def __init__(self):
super(printLRUList, self).__init__("printlrulist", gdb.COMMAND_USER)
def invoke(self, arg, from_tty):
args = arg.split()
if not get_number_of_elements(args) == 1:
print("Please pass a buffer pool instance as parameter")
return None
buf_pool = gdb.execute("p &" + arg, to_string=True).split("=")[1].strip()
print("Buffer Pool Instance: " + buf_pool)
if "(buf_pool_t *)" not in buf_pool:
print(arg + " is not of type buf_pool_t")
return None
raw_number_of_pages = gdb.execute("p (" + buf_pool + ")->LRU->count", to_string=True)
number_of_pages = raw_number_of_pages.split("=")[1].strip()
print("Total Pages in LRU: ")
print(number_of_pages)
gdb.execute("set $lru_page=(" + buf_pool + ")->LRU->start", to_string=True)
io_fix_none = gdb.execute("p $lru_page.io_fix", to_string=True).endswith("BUF_IO_NONE\n")
oldest_modification = gdb.execute("p $lru_page.oldest_modification", to_string=True).endswith("0\n")
buf_fix_count = gdb.execute("p $lru_page.buf_fix_count", to_string=True).endswith("0\n")
if not io_fix_none or not oldest_modification or not buf_fix_count:
data = gdb.execute("p $lru_page", to_string=True)
print("Found page with IO fix issue:")
print(data)
last_page = gdb.execute("p $lru_page.LRU.next", to_string=True).endswith("0x0\n")
while not last_page:
gdb.execute("set $lru_page=$lru_page.LRU.next", to_string=True)
io_fix_none = gdb.execute("p $lru_page.io_fix", to_string=True).endswith("BUF_IO_NONE\n")
oldest_modification = gdb.execute("p $lru_page.oldest_modification", to_string=True).endswith("0\n")
buf_fix_count = gdb.execute("p $lru_page.buf_fix_count", to_string=True).endswith("0\n")
if not io_fix_none or not oldest_modification or not buf_fix_count:
data = gdb.execute("p $lru_page", to_string=True)
print("Found page with IO fix issue:")
print(data)
last_page = gdb.execute("p $lru_page.LRU.next", to_string=True).endswith("0x0\n")
printLRUList()
On GDB:
(gdb) p srv_buf_pool_instances
+p srv_buf_pool_instances
$1294173 = 4
(gdb) printlrulist buf_pool_ptr[0]
+printlrulist buf_pool_ptr[0]
Buffer Pool Instance: (buf_pool_t *) 0x34300d8
Total Pages in LRU:
80882
(gdb) printlrulist buf_pool_ptr[1]
+printlrulist buf_pool_ptr[1]
Buffer Pool Instance: (buf_pool_t *) 0x3430980
Total Pages in LRU:
80883
Found page with IO fix issue:
++p $lru_page
$1941237 = (buf_page_t *) 0x7f26d091db00
(gdb) printlrulist buf_pool_ptr[2]
+printlrulist buf_pool_ptr[2]
Buffer Pool Instance: (buf_pool_t *) 0x3431228
Total Pages in LRU:
80882
(gdb) printlrulist buf_pool_ptr[3]
+printlrulist buf_pool_ptr[3]
Buffer Pool Instance: (buf_pool_t *) 0x3431ad0
Total Pages in LRU:
80883
We can see that instance 1 has one page that is not ready to be freed, while all other instances have all their 80k+ pages on a free state.
Checking the page in question, we can see that it still has buf_fix_count with 1, while other two conditions are ok:
(gdb) p ((buf_page_t *) 0x7f26d091db00)->io_fix
$4 = BUF_IO_NONE
(gdb) p ((buf_page_t *) 0x7f26d091db00)->oldest_modification
$5 = 0
(gdb) p ((buf_page_t *) 0x7f26d091db00)->buf_fix_count
$6 = 1
buf_fix_count will be decremented when a mtr transaction commits, just like the fix on the bug report mentioned. However here we are not facing the same issue, because the the page in question is not a temporary page as it belongs to space 64 and temporary table space has ID 62:
(gdb) p ((buf_page_t *) 0x7f26d091db00)->id
+p ((buf_page_t *) 0x7f26d091db00)->id
$2588318 = {
m_space = 64,
m_page_no = 333376,
m_fold = 67442304
}
(gdb) p srv_tmp_space.m_space_id
+p srv_tmp_space.m_space_id
$2588319 = 62
We have only 3 groups of threads running:
(gdb) info thread
Id Target Id Frame
* 1 LWP 9773 0x000000000105ec27 in logs_empty_and_mark_files_at_shutdown () at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/log/log0log.cc:2790
2 LWP 10897 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
3 LWP 9992 0x00007f278dc3e58a in sigwaitinfo () from /bigdisk/CS0029155/tmp/libc.so.6
4 LWP 10900 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
5 LWP 10902 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
6 LWP 10896 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
7 LWP 10901 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
8 LWP 10905 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
9 LWP 10898 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
10 LWP 10903 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
11 LWP 10904 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
12 LWP 11285 0x00007f278fb3ca35 in pthread_cond_wait@@GLIBC_2.3.2 () from /bigdisk/CS0029155/tmp/libpthread.so.0
13 LWP 10899 0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
IO threads (all IDLE), InnoDB main thread and mysqld main thread, no user threads. This shows that there is a code path in which we can leave a page with buf_fix_count without decrementing it, however it's not possible to do reach to a conclusion based on this coredump as it's showing the instance state after the fact.
Other important thing is that this particular page is coming from the LRU list, it has BUF_IO_NONE, it's not dirty(has no modifications to it) and it's not part of flush_list (list that contains dirty pages).
Details
Details
Assignee

Reporter

Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist

Server crashes while shutting down InnoDB; Very similar to https://bugs.mysql.com/bug.php?id=95285 but that was fixed in 5.7.31 and customer is running running PXC 5.7.34
\
Fix commit:https://github.com/mysql/mysql-server/commit/b161d236ede9e087642c35a28818ebe7636c27fc
And we can see the code is present in PXC 5.7.34:https://github.com/percona/percona-xtradb-cluster/blob/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/row/row0mysql.cc#L2720
Backtrace of the crashing thread:
(gdb) bt #0 0x00007f278fb3daa1 in pthread_kill () from /bigdisk/CS0029155/tmp/libpthread.so.0 #1 0x0000000000d713a5 in handle_fatal_signal (sig=6) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/sql/signal_handler.cc:267 #2 <signal handler called> #3 0x00007f278dc3d387 in raise () from /bigdisk/CS0029155/tmp/libc.so.6 #4 0x00007f278dc3ea78 in abort () from /bigdisk/CS0029155/tmp/libc.so.6 #5 0x000000000078aa32 in ut_dbg_assertion_failed (expr=expr@entry=0x0, file=file@entry=0x162b430 "/mnt/workspace/percona-xtradb-cluster-5.7-redhat-binary/label/min-centos-7-x64/rpmbuild/BUILD/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/ut/ut0ut.cc", line=line@entry=924) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/ut/ut0dbg.cc:75 #6 0x00000000011ae5ed in ib::fatal::~fatal (this=0x0, __in_chrg=<optimized out>) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/ut/ut0ut.cc:924 #7 0x00000000011f25ea in buf_all_freed_instance (buf_pool=<optimized out>) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/buf/buf0buf.cc:6380 #8 0x00000000011f265c in buf_all_freed () at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/buf/buf0buf.cc:7360 #9 0x000000000105ec27 in logs_empty_and_mark_files_at_shutdown () at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/log/log0log.cc:2790 #10 0x00000000010800a8 in OSMutex::destroy (this=<optimized out>) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/include/sync0types.h:467