Server crashes during shutdown

General

Escalation

General

Escalation

Description

Server crashes while shutting down InnoDB; Very similar to https://bugs.mysql.com/bug.php?id=95285 but that was fixed in 5.7.31 and customer is running running PXC 5.7.34

\
Fix commit:https://github.com/mysql/mysql-server/commit/b161d236ede9e087642c35a28818ebe7636c27fc

And we can see the code is present in PXC 5.7.34:https://github.com/percona/percona-xtradb-cluster/blob/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/row/row0mysql.cc#L2720
Backtrace of the crashing thread:

(gdb) bt #0 0x00007f278fb3daa1 in pthread_kill () from /bigdisk/CS0029155/tmp/libpthread.so.0 #1 0x0000000000d713a5 in handle_fatal_signal (sig=6) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/sql/signal_handler.cc:267 #2 <signal handler called> #3 0x00007f278dc3d387 in raise () from /bigdisk/CS0029155/tmp/libc.so.6 #4 0x00007f278dc3ea78 in abort () from /bigdisk/CS0029155/tmp/libc.so.6 #5 0x000000000078aa32 in ut_dbg_assertion_failed (expr=expr@entry=0x0, file=file@entry=0x162b430 "/mnt/workspace/percona-xtradb-cluster-5.7-redhat-binary/label/min-centos-7-x64/rpmbuild/BUILD/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/ut/ut0ut.cc", line=line@entry=924) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/ut/ut0dbg.cc:75 #6 0x00000000011ae5ed in ib::fatal::~fatal (this=0x0, __in_chrg=<optimized out>) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/ut/ut0ut.cc:924 #7 0x00000000011f25ea in buf_all_freed_instance (buf_pool=<optimized out>) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/buf/buf0buf.cc:6380 #8 0x00000000011f265c in buf_all_freed () at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/buf/buf0buf.cc:7360 #9 0x000000000105ec27 in logs_empty_and_mark_files_at_shutdown () at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/log/log0log.cc:2790 #10 0x00000000010800a8 in OSMutex::destroy (this=<optimized out>) at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/include/sync0types.h:467

Environment

None

AFFECTED CS IDs

CS0029155

Linked issues

is duplicated by

PS-8174

Assertion failure: buf0flu.cc:3567:UT_LIST_GET_LEN(buf_pool->flush_list) == 0

Smart Checklist

Activity

Show:

Marcelo Altmann November 8, 2023 at 1:18 PM

This is a duplicate of https://jira.percona.com/browse/PS-8174 .

Bug has been fixed and sent to upstream

Marcelo Altmann August 13, 2022 at 12:16 AM
Edited

Server is crashing at shutdown, specifically at buf_all_freed_instance . This function ensures that all pages can be replaced (eg.: do not have any thread waiting on it). Workflow is as follow:

buf_all_freed -> buf_all_freed_instance -> buf_flush_ready_for_replace -> buf_flush_ready_for_replace

We traverse all buffer pool instances and check each page in the LRU list to ensure that it's io_fix == BUF_IO_NONE, oldest_modification ==0 and buf_fix_count == 0.

On the coredump provided, we can see that we have 4 BP instances, using python script below, we can traverse the pages and check which BP has any page that is not ready to be freed:

import os
def get_number_of_elements(list):
  count = 0
  for element in list:
      count += 1
  return count

class printLRUList (gdb.Command):
  """Traverse LRU list from a given buffer pool instance and check if all pages are ready to be freed. Print the pages that are not in a free state"""

  def __init__(self):
    super(printLRUList, self).__init__("printlrulist", gdb.COMMAND_USER)


  def invoke(self, arg, from_tty):
    args = arg.split()
    if not get_number_of_elements(args) == 1:
      print("Please pass a buffer pool instance as parameter")
      return None

    buf_pool = gdb.execute("p &" + arg, to_string=True).split("=")[1].strip()
    print("Buffer Pool Instance: " + buf_pool)
    if "(buf_pool_t *)" not in buf_pool:
      print(arg + " is not of type buf_pool_t")
      return None

    raw_number_of_pages = gdb.execute("p (" + buf_pool + ")->LRU->count", to_string=True)
    number_of_pages = raw_number_of_pages.split("=")[1].strip()
    print("Total Pages in LRU: ")
    print(number_of_pages)
    gdb.execute("set $lru_page=(" + buf_pool + ")->LRU->start", to_string=True)
    io_fix_none = gdb.execute("p $lru_page.io_fix", to_string=True).endswith("BUF_IO_NONE\n")
    oldest_modification = gdb.execute("p $lru_page.oldest_modification", to_string=True).endswith("0\n")
    buf_fix_count = gdb.execute("p $lru_page.buf_fix_count", to_string=True).endswith("0\n")
    if not io_fix_none or not oldest_modification or not buf_fix_count:
      data = gdb.execute("p $lru_page", to_string=True)
      print("Found page with IO fix issue:")
      print(data)

    last_page = gdb.execute("p $lru_page.LRU.next", to_string=True).endswith("0x0\n")
    while not last_page:
      gdb.execute("set $lru_page=$lru_page.LRU.next", to_string=True)
      io_fix_none = gdb.execute("p $lru_page.io_fix", to_string=True).endswith("BUF_IO_NONE\n")
      oldest_modification = gdb.execute("p $lru_page.oldest_modification", to_string=True).endswith("0\n")
      buf_fix_count = gdb.execute("p $lru_page.buf_fix_count", to_string=True).endswith("0\n")
      if not io_fix_none or not oldest_modification or not buf_fix_count:
        data = gdb.execute("p $lru_page", to_string=True)
        print("Found page with IO fix issue:")
        print(data)


      last_page = gdb.execute("p $lru_page.LRU.next", to_string=True).endswith("0x0\n")


printLRUList()

On GDB:

(gdb) p srv_buf_pool_instances
+p srv_buf_pool_instances
$1294173 = 4

(gdb) printlrulist buf_pool_ptr[0]
+printlrulist buf_pool_ptr[0]
Buffer Pool Instance: (buf_pool_t *) 0x34300d8
Total Pages in LRU:
80882

(gdb) printlrulist buf_pool_ptr[1]
+printlrulist buf_pool_ptr[1]
Buffer Pool Instance: (buf_pool_t *) 0x3430980
Total Pages in LRU:
80883
Found page with IO fix issue:
++p $lru_page
$1941237 = (buf_page_t *) 0x7f26d091db00

(gdb) printlrulist buf_pool_ptr[2]
+printlrulist buf_pool_ptr[2]
Buffer Pool Instance: (buf_pool_t *) 0x3431228
Total Pages in LRU:
80882

(gdb) printlrulist buf_pool_ptr[3]
+printlrulist buf_pool_ptr[3]
Buffer Pool Instance: (buf_pool_t *) 0x3431ad0
Total Pages in LRU:
80883

We can see that instance 1 has one page that is not ready to be freed, while all other instances have all their 80k+ pages on a free state.
Checking the page in question, we can see that it still has buf_fix_count with 1, while other two conditions are ok:

(gdb) p ((buf_page_t *) 0x7f26d091db00)->io_fix
$4 = BUF_IO_NONE
(gdb) p ((buf_page_t *) 0x7f26d091db00)->oldest_modification
$5 = 0
(gdb) p ((buf_page_t *) 0x7f26d091db00)->buf_fix_count
$6 = 1

buf_fix_count will be decremented when a mtr transaction commits, just like the fix on the bug report mentioned. However here we are not facing the same issue, because the the page in question is not a temporary page as it belongs to space 64 and temporary table space has ID 62:

(gdb) p ((buf_page_t *) 0x7f26d091db00)->id
+p ((buf_page_t *) 0x7f26d091db00)->id
$2588318 = {
  m_space = 64,
  m_page_no = 333376,
  m_fold = 67442304
}
(gdb) p srv_tmp_space.m_space_id
+p srv_tmp_space.m_space_id
$2588319 = 62

We have only 3 groups of threads running:

(gdb) info thread
  Id   Target Id         Frame
* 1    LWP 9773          0x000000000105ec27 in logs_empty_and_mark_files_at_shutdown () at /usr/src/debug/Percona-XtraDB-Cluster-5.7.34-31.51/storage/innobase/log/log0log.cc:2790
  LWP 10897         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
  LWP 9992          0x00007f278dc3e58a in sigwaitinfo () from /bigdisk/CS0029155/tmp/libc.so.6
  LWP 10900         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
  LWP 10902         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
  LWP 10896         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
  LWP 10901         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
  LWP 10905         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
  LWP 10898         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
 LWP 10903         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
 LWP 10904         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1
 LWP 11285         0x00007f278fb3ca35 in pthread_cond_wait@@GLIBC_2.3.2 () from /bigdisk/CS0029155/tmp/libpthread.so.0
 LWP 10899         0x00007f278f92f644 in __io_getevents_0_4 () from /bigdisk/CS0029155/tmp/libaio.so.1

IO threads (all IDLE), InnoDB main thread and mysqld main thread, no user threads. This shows that there is a code path in which we can leave a page with buf_fix_count without decrementing it, however it's not possible to do reach to a conclusion based on this coredump as it's showing the instance state after the fact.
Other important thing is that this particular page is coming from the LRU list, it has BUF_IO_NONE, it's not dirty(has no modifications to it) and it's not part of flush_list (list that contains dirty pages).

Duplicate

Details

Assignee

Marcelo Altmann(Deactivated)

Reporter

Marcos Albe

Affects versions

5.7.34-37 (Q2 2021)

Priority

Medium

Smart Checklist

Created August 10, 2022 at 5:50 PM

Updated March 6, 2024 at 10:02 AM

Resolved November 8, 2023 at 1:18 PM

Configure

Server crashes during shutdown

Description

Environment

AFFECTED CS IDs

Linked issues

is duplicated by

Smart Checklist

Activity

Marcelo Altmann November 8, 2023 at 1:18 PM

Marcelo Altmann August 13, 2022 at 12:16 AMEdited

Details

Assignee

Reporter

Affects versions

Priority

Smart Checklist

Marcelo Altmann August 13, 2022 at 12:16 AM
Edited