Issues

Select view

Select search mode

 
25 of 25

2 cluster nodes go in to NON-PRIMARY state MDL BF-BF conflict

Description

We have a 3 node + 1 arbitrator cluster

During online schema change we hit the following errors on 2 of the nodes, all 3 nodes log included below.

001:

002

003

This was running during a load test - we also did some no-op DDL changes with online-schema change (10’s of them) and not every one of them broke.

We have HAProxy in front of them and do read/write splitting at the DNS level. 001 node receives all write traffic where 002 and 003 nodes receives read only traffic.

Environment

None

Details

Assignee

Reporter

Needs QA

Yes

Time tracking

20m logged

Priority

Smart Checklist

Created September 26, 2024 at 3:37 PM
Updated March 2, 2025 at 8:00 PM

Activity

Show:

Kamil Holubicki December 13, 2024 at 4:07 PM
Edited

Hi , The issue didn’t progress.

Please follow these instructions to upload.

Scott Hooper December 13, 2024 at 2:45 PM

Does anyone know if this issue goes away with 8.0.39 which was just released?

Scott Hooper December 2, 2024 at 1:18 PM

We have a jenkins job that causes this one, we have sleeps in it to slow it down but it still happens from time to time. We mostly run it off hours to limit impact. If you can get me a secure sftp link like you all do for the memory dumps I can upload the jenkins jobs scriptw

Kamil Holubicki November 28, 2024 at 3:11 PM

In my opinion we have two similar situations:

Case 1:

This should be fixed in 8.0.36, 8.4.0 by

Case 2:

Here we see again a local thread (granted), being aborted by replication thread. It seems the granted thread didn’t release all MDL locks before releasing CommitOrder monitor in galera (before letting the other thread, which is replicaton thread, to go on). So most probably there is yet another execution path similar to the one fixed in .

Reproduction steps would be very helpful.

Aaditya Dubey November 28, 2024 at 2:54 PM
Edited

Hello

I’ve tried to repeat the issue from my end, but unfortunately, it is not repeating; please find my test case below:

  1. Create the following tables:

  1. Add the data to tables:

  2. On session 1, run sysbench:

  3. On session 2, run optimize table:

  4. On session 3, run delete-insert:

  5. On session 4, run rename table:

  6. After some time, restart either node2 or node3.

Please check and let me know if it is the same scenario that is being executed in your environment. If not, please let me know what other processes need to run or if there is any specific order that can be used to repeat the issue.

Loading...