WSREP: BF lock wait long

General

Escalation

General

Escalation

Description

We are running three node Percona XtraDB Cluster with three equal nodes. We are on version 5.7.25-28-57-log Percona XtraDB Cluster (GPL), Release rel28, Revision a2ef85f, WSREP version 31.35, wsrep_31.35.

Since we started cluster operation we encounter the common problem of WSREP: BF lock wait long. The symptomps are commonly known and present on DB forums. The node causing the problem tries to commit transaction and it never finishes. Entire cluster stops writing changes. The only way out of this situation is hard kill the problematic node. This is happening several times a day. The hunging transactions are different from case to case. We don't have any massive TRUNCATE TABLE in the transaction causing the operation stop. The transactions being stuck are not any bigger (they are rather smaller ones) that other transactions in the system, which does not cause any problem.

We were looking through different DB forums for solution. We were able to read several cases with exactly the same symptomps, but with no evidence of the cause, nor any glue to some fix to the problem.

Is there anyone having some idea of how to fix this problem?

Environment

5.7.25-28-57-log Percona XtraDB Cluster (GPL), Release rel28, Revision a2ef85f, WSREP version 31.35, wsrep_31.35

Linux stable-db7 3.16.0-6-amd64 #1 SMP Debian 3.16.57-2 (2018-07-14) x86_64 GNU/Linux

AFFECTED CS IDs

265886

Attachments

21 Nov 2019, 07:42 AM
14 May 2019, 12:23 PM
07 May 2019, 07:34 PM
07 May 2019, 07:34 PM
07 May 2019, 07:34 PM
07 May 2019, 07:34 PM

Smart Checklist

Activity

Show:

Robert Roth May 20, 2020 at 9:00 AM

Hi everyone, even though this was closed as incomplete, commenting here on our current status with some good news: since 5.6.47 came out, after checking the changelog I have identified that #2684 (fixed in 5.6.47) closely matches our use-case (heavy use of stored procedures, with occasional updates), so we updated quickly, and we allowed full load (up until now on the updated cluster node hanging the most often we allowed only 1 connection) for it to be just like the other members (all of them running 5.6.36, not having BF lock wait long issues), and we did not have BF lock wait issues since then (with 8 days uptime so far, up until now we had cluster hang once a week, or in some cases even twice a day).

Jira Bot April 4, 2020 at 10:56 AM

Hello @Michal Hornych,
It's been 52 days since this issue went into Incomplete and we haven't heard
from you on this.

At this point, our policy is to Close this issue, to keep things from getting
too cluttered. If you have more information about this issue and wish to
reopen it, please reply with a comment containing "jira-bot=reopen".

Jan Švercl March 27, 2020 at 12:54 PM

Hi,
let me recap our situation regarding the error with "BF lock wait long".

We installed and started using Pecona Cluster for production environment in March 2019. We are using a cluster consisting of 3 nodes of the same HW configuration. At that time, we used all 3 nodes for write (write-write-write configuration).
Since there was often a "BF lock wait long" error, we tried to reconfigure the proxy, so SQL writes were primarily performed by only one node, while the others were used for reads (write-read-read configuration).

In this configuration, the stability was improved significantly, the "BF lock wait long" error disappears. But there was an another error, freezing the entire server (see my comment from 21.Lis 2019). It was also fixed (big thanks for Krunal Bauskar). The error was caused by setting timezone after each connection established - we disabled it.

Now, the server is running smoothly, but still in write-read-read configuration. It is possible that frequent timezone settings also caused a "BF lock wait long" problem, but this cannot be easily confirmed.
We are continuously installing updates, it is possible that some of them have solved the error, but unfortunately, we cannot confirm it too.

We are considering trying to use the cluster in write-write-write mode, where all nodes will run write queries.

if the problem will occur again in the future, I'll write again
thanks to everyone for the comments.

Jan

Jira Bot March 27, 2020 at 10:56 AM

Hello @Michal Hornych,
It's jira-bot again. Your bug report is important to us, but we haven't heard
from you since the previous notification. If we don't hear from you on
this in 7 days, the issue will be automatically closed.

Jira Bot March 12, 2020 at 9:56 AM

Hello @Michal Hornych,
I'm jira-bot, Percona's automated helper script. Your bug report is important
to us but we've been unable to reproduce it, and asked you for more
information. If we haven't heard from you on this in 3 more weeks, the issue
will be automatically closed.

Incomplete

Details

Assignee

Unassigned

Reporter

Michal Hornych

Labels

5.7crashxtradb

Time tracking

1d 5h 5m logged

Affects versions

5.7.x

Priority

High

Smart Checklist

Created May 7, 2019 at 1:28 PM

Updated March 6, 2024 at 10:10 PM

Resolved April 14, 2020 at 6:46 PM

Configure