Duplicate auto_increment values set for concurrent sessions on cluster reconfiguration

General

Escalation

General

Escalation

Description

For a given example table:

During multi concurrent insert test, where primary, auto-inc column is unset and other unique columns have always unique values, occasionally, on cluster resize (like when nodes leaving/joining), there are conflicting auto_increment values assigned occasionally.

With INSERT INTO ... ON DUPLICATE KEY UPDATE, this leads to overwriting unique rows, hence loosing data!

Example binary log event:

Example scripts for test case, error log and binlog attached.

Reproduced on both PXC 5.7.19 and 5.7.21, in 3-node cluster and 5-node cluster.

Environment

None

AFFECTED CS IDs

224754

Attachments

Smart Checklist

Activity

Show:

Krunal Bauskar June 13, 2018 at 10:06 AM

commit aa3fd356ac2fd9e32cf980be2bab830ed7724e09
Merge: d0961888d47 675c863c2ca
Author: Krunal Bauskar <krunal.bauskar@percona.com>
Date: Wed Jun 13 15:32:52 2018 +0530

Merge pull request #650 from percona/pxc-5.6-2128

PXC#2128: Duplicate auto_increment values set for concurrent

commit 675c863c2ca7a44b69399ae3793e92c60ba9a62e

Author: Krunal Bauskar <krunal.bauskar@percona.com>
Date: Tue Jun 12 21:04:15 2018 +0530

PXC#2128: Duplicate auto_increment values set for concurrent
sessions on cluster reconfiguration

Issue:

node leaving cluster can cause dynamic adjustment to auto-increment
increment and offset based on nodes in primary component.

this can affect the auto-inc series and innodb/mysql will need to
re-calculate auto-inc series based on updated values.

with auto-inc re-configuration, auto-inc series is re-calculated.
for example with off/inc = 3/3 series is say 27, 30, 33, 36, 39,....
but let's say after 33 off/inc changes to 2/2 then series needs
to be re-evaluated to adjust to new values.

when mysql flow detects decrease in increment it needs to step
back and re-evaluate the series. for example: 33 is inserted,
the next value to insert is 36 but if configuration changes
then standalone server will re-construct 33 (36 - 3 (old off))
and using 33 as base will project the next number in the series
to insert.

in pxc, with wsrep_auto_increment_control=on, if 33 is inserted
and off=3/incr=3 (3 node cluster) then next number for the said
node is 36. pxc can't backtrack old number because it can potentially
generate number between (34,35) which could conflict with insert
that is running on node-2,3 that has 34, and 35 reserved.

pxc skips back-tracking and instead use the next number in series.
unfortunately, this part of code though skipped back-tracking or
readjustment continue to re-calibrate the number.
while pxc continue to use 36 as base (vs standalone using 33
so needs re-calibration), pxc flow doesn't need re-calibration
as pxc retained original number.

this re-calibration, caused generation of same value for
current insert and next value insert.
(mysql flow identify this use-case and move next-value to
next legitimate value but only after insert is done on success path).

in meantime, if another thread uses this next-value and proceed
with insert it will end-up inserting next-value (assuming it is next
value to consume) and current thread that generated curr-value = next-value
will also insert same value resulting in duplicate-key-error.

Fix:
—

Since pxc doesn't re-adjust the autoinc on decrease in incr, pxc
should avoid re-calibration.

Krunal Bauskar June 12, 2018 at 1:13 PM

We have successfully reproduced the issue (using a shorter 10 line test-case)
and also found the root-cause of the issue.

In short, on auto-inc configuration change, pxc skips readjust of auto-inc value
as it would potentially conflict with other node range.

(Say next autoinc value=x with incr=y then standalone will use x-y as base value
and re-calibrate it with new auto-inc configuraton that could potentially
generate value in range of (x-y, x-1))

PXC skipped this part of the code but accidentally retained calibration
that is done post readjustment. Since readjustment is skipped there is no
need of re-calibration.

This un-needed re-calibration caused the issue generating duplicate
value for current and next-value.

---------------

Currently I am working on the fix that will be followed by local-testing.
Once ready I will post another update.

Krunal Bauskar June 11, 2018 at 11:46 AM

Just to post an update ...

We have the setup ready and we are investigating the issue.

We suspect that the code that re-aligns (on cluster membership change) the auto-increment series for galera is causing the issue. Infact, the code was consciously edited by upstream to skip some logic that is normally used in standalone mode (of InnoDB). We continue to evaluate if this decision to skip logic make sense.

Unfortunately, it doesn't look like anything is missing but probably a bug in existing flow so need to understand why the said logic was added by upstream.