LP #1500176: innodb main thread locks itself in &dict_operation_lock

Description

**Reported in Launchpad by Rick Pizzi last update 25-07-2017 09:15:27

Yesterday our master got a severe stall which ended in InnoDB committing suicide with an assertion:

InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2015-09-26 20:54:46 7efda0ca3700 InnoDB: Assertion failure in thread 139627789432576 in file srv0srv.cc line 2128

The InnoDB engine status dump (attached to this bug) shows entire INNODB subsystem locked , with main thread as the blocker, which seem stuck in state "enforcing dict cache limit"

Looking at the error log, what puzzles me is the following (139627810412288 is the main thread):

--Thread 139627810412288 has waited at srv0srv.cc line 2596 for 241.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x134c460 '&dict_operation_lock'
a writer (thread id 139627810412288) has reserved it in mode wait exclusive
number of readers 7, waiters flag 1, lock_word: fffffffffffffff9
Last time read locked in file row0ins.cc line 1803
Last time write locked in file /mnt/workspace/percona-server-5.6-redhat-binary/label_exp/centos6-64/rpmbuild/BUILD/percona-server-5.6.25-73.1/storage/innobase/dict/dict0stats.cc line 2385

From the above it seems that the blocker was blocked by itself. This seems quite odd!!
The thread already owns &dict_operation_lock but wants to get it twice. That's a deadlock.

Running Percona Server 5.6.25-73.1-log . my.cnf and error log attached.

Additional info which may help troubleshooting:
At the time the problem occurred there was an online schema change running and about to complete.
The table being altered was a very large (1 billion rows) partitioned one, it is likely that the OSC tool was about to execute or executing the atomical rename of tables, though this is just a speculation as the operation never completed, and therefore is not logged in binary logs or anywhere else.

Thanks
Rick

Environment

None

Smart Checklist

Activity

Show:

Jira Bot March 29, 2020 at 4:15 PM

Hello ,
It's been 60 days since this issue went into Incomplete and we haven't heard
from you on this.

At this point, our policy is to Close this issue, to keep things from getting
too cluttered. If you have more information about this issue and wish to
reopen it, please reply with a comment containing "jira-bot=reopen".

Jira Bot March 13, 2020 at 8:56 AM

Hello ,
It's jira-bot again. Your bug report is important to us, but we haven't heard
from you since the previous notification. If we don't hear from you on
this in 7 days, the issue will be automatically closed.

Jira Bot February 27, 2020 at 8:56 AM

Hello ,
I'm jira-bot, Percona's automated helper script. Your bug report is important
to us but we've been unable to reproduce it, and asked you for more
information. If we haven't heard from you on this in 3 more weeks, the issue
will be automatically closed.

Lalit Choudhary January 29, 2020 at 8:03 AM

Hello Rick,

Let us know if this issue still exists for you.  If yes please provide following details,

  1. Full output when it crashed.

  2. SHOW ENGINE INNODB STATUS (output at regular intervals when it is hung or before that)

  3. coredump file / gdb stacktrace output file.

Satya Bodapati April 3, 2019 at 9:05 AM

please move this one to New/confirmed status

Incomplete

Details

Assignee

Reporter

Priority

Smart Checklist

Created January 24, 2018 at 8:31 AM
Updated March 6, 2024 at 2:00 PM
Resolved March 29, 2020 at 4:15 PM