Async Replication threads on PXC node hangs upon resuming from applier error after fixing

Description

  1. Setup async replication channel on the PXC node with parallel-workers with SPCO ON with some PS server as master. (refer the steps on PXC-4173) (Branch: https://github.com/venkatesh-prasad-v/percona-xtradb-cluster/tree/8.0-PXC-4173-monitor )

  2. connect to the PXC node where replication threads are running and create some database say db1.

  3. create same database db1 on the master ps server. verify the replication fails on the PXC node due to applier ERROR.

  4. drop the database db1 on the pic node to fix the applier error and start the replica.

  5. Replication threads will start successfully but will just be hanged and will not execute the create database db1 from the source server and also any subsequent transactions. It seems because the new code would have updated the monitor seqno next to the failed transaction and hence would be waiting for the transaction with next seq number whereas server SPCO wait would wait fro the create database db1 transaction to commit.


    mysql> show processlist; +----+-----------------+-----------------+------+---------+------+---------------------------------------------+------------------+---------+-----------+---------------+ | Id | User | Host | db | Command | Time | State | Info | Time_ms | Rows_sent | Rows_examined | +----+-----------------+-----------------+------+---------+------+---------------------------------------------+------------------+---------+-----------+---------------+ | 1 | system user | | NULL | Sleep | 2523 | wsrep: aborter idle | NULL | 2522711 | 0 | 0 | | 2 | system user | | NULL | Sleep | 2523 | innobase_commit_low (-1) | NULL | 2522711 | 0 | 0 | | 7 | event_scheduler | localhost | NULL | Daemon | 2520 | Waiting on empty queue | NULL | 2519595 | 0 | 0 | | 10 | system user | | NULL | Sleep | 2520 | wsrep: applier idle | NULL | 2519584 | 0 | 0 | | 23 | system user | connecting host | NULL | Connect | 2459 | Waiting for source to send event | NULL | 2458514 | 0 | 0 | | 31 | system user | | NULL | Query | 2300 | Waiting for dependent transaction to commit | NULL | 2299344 | 0 | 0 | | 32 | system user | | db1 | Connect | 2392 | checking permissions | NULL | 2339362 | 0 | 0 | | 33 | system user | | NULL | Connect | 2340 | Waiting for an event from Coordinator | NULL | 2339397 | 0 | 0 | | 34 | system user | | NULL | Connect | 2340 | Waiting for an event from Coordinator | NULL | 2339391 | 0 | 0 | | 35 | system user | | NULL | Connect | 2340 | Waiting for an event from Coordinator | NULL | 2339386 | 0 | 0 | | 37 | root | localhost | NULL | Query | 0 | init | show processlist | 1 | 0 | 0 | +----+-----------------+-----------------+------+---------+------+---------------------------------------------+------------------+---------+-----------+---------------+

Environment

None

Activity

Show:

Venkatesh Prasad November 11, 2024 at 4:13 AM

Fixed now

Done

Details

Assignee

Reporter

Needs QA

Sprint

Fix versions

Priority

Smart Checklist

Created September 24, 2024 at 11:27 AM
Updated April 16, 2025 at 5:34 PM
Resolved January 22, 2025 at 7:24 AM