GTID holes on cluster peers of async slave node

Description

This is apparently a regression of old bug: https://jira.percona.com/browse/PXC-1868

In a PXC1<async>PXC2 kind of topology, I reproduced GTID holes on peers of an async slave node, exactly as it was described in that old bug report.
When there was data inconsistency between two clusters and an even was skipped (empty trx injected on the slave node), other peers ended up with holes in GTID executed:

Tested on version:

but also observed on PXC 5.6.36.

Environment

None

AFFECTED CS IDs

267275

Smart Checklist

Activity

Show:

Julia Vural March 4, 2025 at 9:28 PM

It appears that this issue is no longer being worked on, so we are closing it for housekeeping purposes. If you believe the issue still exists, please open a new ticket after confirming it's present in the latest release.

Krunal Bauskar October 11, 2019 at 2:44 AM

Thanks for the clarification. We would surely keep it active and look into it but since it is user induced error (given the workaround) I would say the easiest way out is to rely on default/auto behavior.

  1. PXC doesn't recommend running out-of-order transactions.

  2. If there is such a case then ignoring those errors on the slave will ensure no GTID holes.

Przemyslaw Malkowski October 10, 2019 at 2:54 PM

Thank you Krunal. Indeed, this is not exactly a regression after all indeed.
The thing is, when a replication error is skipped automatically - either by slave_skip_errors variable use or by slave_exec_mode=idempotent, the GTID holes are not created, so the fix to the referenced older bug addressed those cases.
The other scenario is when an error is not skipped by those settings use automatically, but rather is workarounded manually. It may be with help of pt-slave-restart tool or by simple injecting an empty GTID event on the async slave node. And this is what I did to reproduce the issue:

And that action leaves the async slave node OK in terms of GTID sets, but the PXC peer nodes end up with GTID holes:

Krunal Bauskar October 9, 2019 at 5:47 AM

There are multiple scenarios so let me list all of them

Scenario:
-------------

1. independent-master
       <---- async repl channel ----> (pxc-cluster-node:1, pxc-cluster-node:2)

node-1 is async slave of independent master and so out-of-order action on node-1
and then on master can cause async replication channel to break.

2. independent-master
         <---- async repl channel ----> (pxc-cluster-node:1 (with slave-skip-errors=1032), pxc-cluster-node:2)

same as use-case-1 but since slave-skip-errors=1032 it would not cause replication channel to break.

(bug () of gtid hole was reported against use-case-2 which was solved then and we have mtr test-case for the same.
so use-case-1/2 are working as expected even now).


3. (pxc-cluster-1-node-1, pxc-cluster-1-node-2)
         <----- async repl channel -----> (pxc-cluster-2-node-1, pxc-cluster-2-node-2)

2 pxc clusters trying to replicate. this is variation of use-case-1 but here we have 2 pxc-clusters with async repl link.
As expected, if out-of-order action is executed on pxc-cluster-2 and then on pxc-cluster-1 replication channel will break.

4. (pxc-cluster-1-node-1, pxc-cluster-1-node-2)
        <----- async repl channel -----> (pxc-cluster-2-node-1 (with slave-skip-errors=1032), pxc-cluster-2-node-2)

again a variant of use-case-2 with slave-skip-errors=1032 that will avoid breaking async repl channel link also no gtid gap observed.

(use-case-3/4 are working as expected)

-------------------------------------

From what I understood

is referring to use-case-3 or 4. (pxc-cluster to cluster async replication "When there was data inconsistency between two clusters ").

1. These bugs are surely different from so it is not a regression.
2. At least, for now, I couldn't reproduce the said issue.

Can you help clarify with your exact test-case and setup.

<master>
create database test;
use test;
create table test.g1 (id int auto_increment primary key);
insert into test.g1 values (null),(null),(null);
select * from test.g1;

<slave>
delete from test.g1 where id=3;
flush logs;

<master>
delete from test.g1 where id=3;
flush logs;

Won't Do

Details

Assignee

Reporter

Time tracking

2h 20m logged

Affects versions

Priority

Smart Checklist

Created October 7, 2019 at 1:54 PM
Updated March 4, 2025 at 9:28 PM
Resolved March 4, 2025 at 9:28 PM