No BF-abort but 'MDL conflict ... solved by abort' printed

Description

https://forums.percona.com/t/another-mdl-conflict-db-table-ticket-10-solved-by-abort-issue/25517

We have the same problem after updating cluster from version 8.0.32-24.2 to version 8.0.33-25.

Please note that we do not use the garbd package as described in

https://docs.percona.com/percona-xtradb-cluster/8.0/release-notes/8.0.33-25.upd.html

Environment

None

Activity

Show:

Kamil Holubicki October 12, 2023 at 12:58 PM
Edited

Hi ,

Thank you for providing me with the log files.

What I see inside is this pattern:

It happens when we do something like this:

node_1:

node_2:

If you are lucky enough and a query from node_2 is applied just when SELECT on node_1 is active/running, you will see this SELECT in the log.

If SELECT is finished (note that the transaction still holds MDL locks) the node_1's transaction will be aborted, but in logs, we will see something like this:

 

What happened is that your local transaction was aborted by a replicated transaction, which is expected.

 

 

Case 2:

node_1, session 1:

node_1, session 2:

This time, session 2 will wait for session 1 to finish it's transaction. In logs you will see:

In this case log is wrong. Nothing was aborted, session 2 waits on MDL lock for session 1 to finish.

Arkadiusz Petruczynik October 11, 2023 at 11:05 AM

Ad. 1 During normal workload.

Ad. 2 Every time the application creates views. I am attaching the error log before the node disconnected.

Ad. 3 Similar operations can be performed based on the attached log.

Ad. 4,5 We will try to do it ASAP in a test environment. For now, in production, we have redirected the application creating views (we only have one) to the Percona 8.0.34 standalone server.

Kamil Holubicki October 11, 2023 at 8:40 AM
Edited

Hi ,

I'm not sure if this is the return of as I can't reproduce it using those steps to reproduce. But it looks similar, indeed.

A few questions:

  1. Does it happen during the upgrade of the node, or during normal workload?

  2. How often does it happen?

  3. Are you able to provide some deterministic steps to reproduce the issue?

  4. Could you start the server with wsrep_debug=1 and collect the logs around this issue occurrence?

  5. Do you see any abnormal behavior after this log? Something doesn't work, crashes, asserts, etc?

 

I analyzed the code a bit and here are my findings:

It seems that the message itself is wrong and scares people when it shouldn't, saying that the ticket is solved by abort. The message is printed from here.] It is printed when
wsrep_handle_mdl_conflict() returns false. If we look inside wsrep_handle_mdl_conflict(), we see that returning false doesn't necessarily mean aborting the victim transaction. Actually, this function always returns false , but not always BF-aborts. It means that the requestor thread, after calling wsrep_handle_mdl_conflict(), needs to wait for MDL lock.
 

Done

Details

Assignee

Reporter

Needs Review

Yes

Needs QA

Yes

None

None

Priority

Smart Checklist

Created October 10, 2023 at 11:21 AM
Updated June 6, 2024 at 7:58 AM
Resolved January 16, 2024 at 6:17 PM
Loading...