Data inconsistencies when high rate of pages split/merge

General

Escalation

General

Escalation

Description

Over many months, we have been chasing a strange bug affecting a versioned key/value store. For legacy reasons, the store encodes the version in reverse order as "MaxUnsignedBigint - version. Because the values are often retrieved by ranges of keys, the following form of where clause is used: "where key >= 1 and key <= 1". Here, the upper and lower bound are the same, this should be logically equivalent to "where key = 1", but we found it is not the case.

For a given key, the version is only increasing. We noticed that some times, we get either no rows or a value smaller than the previous highest version observed (watermark).

So, here's what we know about the bug:

it happens when there is page split/merge activity involving blob in row prefix
it happens only when the reverse encoding of the version is used
it happens only when the double range where clause is used.

Included in this report is the repro_bug.go program that can be used to reproduce the bug. The program creates a table, insert 3 rows for key 0, 1 and 2 with version 0 and then starts 4 threads:

One updating the value of key = 0
One update the value of key = 2
One inserting news rows with higher versions for key = 1
One reading the highest version for key = 1 and recording the highwatermark. It also deletes older versions at every 20 iterations.

In order to reproduce the query:

start a MySQL 8 instances, here with docker:

Run the program with the following arguments:

The output should like like:

Different runs will fail at different version values but it is usually within one minute. As we see above, versiong 367 was observed in the previous round but not in the last. Looking at the database directly:

The row for version 367 has not been removed but yet not returned.

Once in while, maybe 10% of the times, the version returned will be 0, here’s an example:

Which is kind of strange as the only row inserted with version=0 for id = 1 was the initial one, a row deleted many iterations before.

Environment

None

AFFECTED CS IDs

CS0043586

Attachments

Linked issues

blocks

PS-9174

Analyze and backport bug fixes from MySQL 8.0.37 (PS 5.7.44-50 post-EOL 2)

Activity

Show:

Dmitry Lenev June 13, 2024 at 12:39 PM

Hello!

Fixes for the problem were pushed to 8.0.37 and 8.4.0 based release branches:
https://github.com/percona/percona-server/pull/5315
https://github.com/percona/percona-server/pull/5316

Dmitry Lenev March 6, 2024 at 4:25 PM

Reported problem Upstream as

Dmitry Lenev February 23, 2024 at 1:43 PM

Hello!

Our investigation shows that the problem observed is caused by the following scenario.

SELECT reading row versions for the middle record iterates over the PK in descending order, i.e. backward order for the index.
At some point it needs to move cursor used for this iteration from the current page to the previous one in B-tree.
To do this it has to latch previous page first. However, it can’t be done in straightforward way since normal latching
order for the pages in B-tree corresponds to forward iteration. So it releases latch on the current page it has,
and then tries to latch the previous page first and then current page, doing some checks in the process which
should detect that pages and relations between them are still valid. If they are valid this is an optimistic case and
cursor can continue its iteration by switching to previous page if it exists.
If they are not valid cursor resorts to pessimistic scenario in which it tries to restore its position by finding in
B-tree record it has last seen.

If at the point when latches are released merge of previous page into the current one happen, then records
from the previous page are moved across the cursor position. Once cursor iteration resumes, it won’t detect
this fact and will optimistically continue its iteration in backward direction, effectively skipping the records
which were moved. So these records will be omitted from SELECT output.
While pessimistic restore of cursor position by B-tree search would detect these new records and continue iteration
through them.

The same problem doesn’t affect forward iteration over the index as in this case we do not release latch
on the current page before moving to the next one, but rather acquire latch on the next page first, then
move the cursor to it and then only unlatch the former current page. So potential merge of the next page
into current one will be always blocked by these latches.

Dmitry Lenev February 22, 2024 at 8:14 AM

Hello!

Forgot to mention that MySQL 5.7.44 is affected as well.

Dmitry Lenev February 19, 2024 at 7:33 AM

Hello!

I have the following observation while debugging this issue. The problem doesn’t seem repeatable for me (or at least repeatable that easily) if I use descending indexes for primary key. I.e.:

So perhaps descending indexes can be used as a workaround (assuming that there is no need to scan through PK in ascending order)?