MySQL crash due to get a native index from get_mutex_cond in group replication
Description
Environment
AFFECTED CS IDs
Activity
Venkatesh Prasad April 22, 2024 at 8:49 AM(edited)
On code analysis, we can conclude that m_sidno is set to -1 at two places:
In handle_transaction_id():
In Blocked_transaction_handler::unblock_waiting_transactions():
1. handle_transaction_id()
When a client thread executes a transaction commit, it first calls the group_replication_before_commit hook for certification where a group gtid is assigned. It will wait in transaction_latch->waitTicket() and the certification is done by the applier thread which sets the waiting thread's tranaction_context structure and signals it to proceed.
If the applier thread doesnt see a seq_number assigned, then it sets m_rollback_transaction=true and m_sidno = -1 and signals the waiting thread to perform a rollback.
Case 1: transaction is positively certified, seq_number > 0
Behavior: seq_number is set in THD::transaction_context
Case 2: transaction is negatively certified, seq_number = 0
Behavior: rollback_transaction is set and m_sidno is set to -1.
Conclusion: There is nothing wrong in the handle_transaction_id()
2. unblock_waiting_transactions()
Even this makes sure it sets the waiting transaction's m_rollback_transaction
before unblocking.
Conclusion: There is nothing wrong in the unblock_waiting_transactions()
However in the core dump analysis, in the transaction_context of the crashing thread, it has the m_rollback_transaction
set to true, meaning unblock was successful
So, the only possibility I see is that the client thread passed the certification test and proceeded with commit, but before it reached generate_automatic_gtid()
, the other thread may have called the unblock_waiting_transactions()
(may be as part of the Plugin_gcs_events_handler::was_member_expelled_from_group()
->leave_group_on_failure:leave()
) thereby resulting in the crash.
Few Observations:
Member got expelled due to network failure and server was set to ERROR state
New transactions failed with
Existing transactions failed with
In the end there was a problem while unblocking waiting threads as well in
unblock_waiting_transactions()
This is mostly due to the return value of
releaseTicket()
meaning that the key was not found in the map. (Need to check why)
jinyou.ma April 18, 2024 at 2:21 AM
I'm still unable to reproduce this issue. I will send a message to get the cluster environment.
jinyou.ma April 16, 2024 at 5:51 AM
I have not found a way to reproduce this issue. I will try tomorrow and provide more information.
On the customer side, the issue occured after the unblock_waiting_transactions
description
MySQL crashed due to an invalid memory address in
inline_mysql_mutex_lock
The backtrace is
The invalid address of
that
in the frame 3 (inline_mysql_mutex_lock
) is below:The main reason is that getting a native index (
-1
) from an array.Because the
n
is-1
, theret
is invalid. The n should be always be a positive number.In the frame 6, the
automatic_gtid.sidno
is-1
.In
MYSQL_BIN_LOG::process_flush_stage_queue
, them_sidno
is-1
.This is because group replication sets the
m_sidno
to-1
.