Removed Prior Commit to Allow High Priority High Transaction Processing
Description
Environment
blocks
relates to
Smart Checklist
Activity

Zsolt Parragi April 21, 2020 at 4:09 PM
That is correct, the fix was the removal of the disabling code / commit, and currently it is only there in the 8.0 branch, not in 5.7.

Zsolt Parragi April 3, 2020 at 10:01 AMEdited
Looks like there are two different approaches here:
From the Codership side, they fixed high priority transactions (in 5.7), that's commit f2dd26e10785.
From our side, we decided to completely disable high priority transactions also in 5.7, in 5be95d8ff766. We committed this 20 days after codership already fixed the problem, based on the commit dates. But we didn't have that commit merged yet.
Later we merged the codership fix, at which point we should have reverted our disabling workaround. But we did not, we even ported it to 8.0.
Both commits mention the same issues, but the first actually fixes the issue, while the second completely disables the entire functionality.
I also verified that high priority transactions aren't specific to group replication, they are actively used by normal 5.7 replication. This means that using normal replication and PXC together doesn't work correctly.
I think Krunal added the assert(0) because he wasn't able to reach that code. And he wasn't able to reach it, because he disabled HPTs completely.
Based on some initial testing, if I remove both our disabling commit and assertion, everything seems to be working correctly. Same goes for 5.7 too.

Zsolt Parragi April 2, 2020 at 5:40 AMEdited
To clarify: WSREP doesn't depend on high priority transactions itself, this is only an interaction with upstream replication.
After looking at the code, we have 4 possible main scenarios here:
high priority transactions with a PXC server, without running wsrep (using it as a simple mysql server)
a high priority transaction killing a local transaction (example: node1 is part of a GR cluster and a PXC cluster. Node1 receives a transaction from another GR node, which kills a local transaction)
a high priority transaction killing a wsrep slave transaction (example: node1 is part of a GR cluster and a PXC cluster. Node1 is replicating a long running transaction from another pxc node, and this transaction gets killed by a GR high priority transaction)
a wsrep slave transaction (trying to) kill a GR high priority transaction (a long running GR high priority transaction conflicts with an incoming transaction from another wsrep node)
The first works, even without any patching, if I remove the assertion from the code.
The patch missing from 8.0 tries to fix a few corner cases in the second (at least based on the bug description I found), and reports a fatal error for the third. I tried to construct a testcase for 8.0 based on the bug description, and it doesn't seem to do anything: PXC can't handle that scenario.
The error message for the 3rd scenario also seems to be removed in 8.0, no idea why.
As far as I see, the 4th isn't mentioned at all anywhere: when wsrep kills a transaction it expects it to succeed: and either it won't be able to do it, or if it can, GR expects high priority transactions to succeed.
For both the 3rd and 4th, there is no way to handle them correctly, that's why the 5.7 code simply reported an error if it ever happened.
The second probably can be fixed, but will take time, and if we want to support that as a use case, we also have to make sure that 3 and 4 report proper and clear error messages, as those will require manual fixes from the user after happening.
Current status:
ported the 5.7 patch to 8.0, similarly to how internal wsrep transaction killing was ported (doesn't seem to result in any difference)
ported all the internal (single node) high priority transaction MTR testcases to the galera suite: after executing the testcase, these also try to validate the results on a second pxc node, which should have the same data as node1 at the end. This tests the second scenario with some tricky high priority scenarios. (Result: tests hang)
The common root cause in all tests seems to be that if wsrep is running, high priority threads fail to kill the victim thread.
With WSREP running, the innodb locks fail to detect a conflict, and because of this, the transaction isn't killed. Comments in lock0lock.cc even explain that locks are more relaxed for various reasons if wsrep is running. Looks like one downside of these relaxations is a broken conflict detection.
Details
Assignee
UnassignedUnassignedReporter
Zsolt ParragiZsolt ParragiNeeds Review
YesTime tracking
2d 6h 15m loggedFix versions
Priority
Medium
Details
Details
Assignee
Reporter

Needs Review
Time tracking
Fix versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

All the high priority transaction mtr test cases crash in an assertion + todo message when the high priority transaction kills a normal priority transaction, as the related wsrep signal isn't implemented.
https://github.com/percona/percona-xtradb-cluster/blob/8.0/storage/innobase/trx/trx0trx.cc#L3557
The todo comment / assertion was left there by Krunal during the initial galera 4 port.