semi-sync master waits for ack when semi-sync replica is down & when we replace it with async
Description
Environment
AFFECTED CS IDs
Activity
Venkatesh Prasad August 16, 2023 at 11:52 AM
Pull Request: https://github.com/percona/percona-server/pull/5104
Kelong Wang August 9, 2023 at 7:12 AM
I saw a very similar issue: semi-sync master waits for ack when a lone semi-sync replica restarts semi-sync.
Version: 8.0.32 or 8.0.33. I didn't test with 8.0.31. Works fine with 8.0.30 or older.
Setup: single master replicating to a single replica with semi-sync running (GTID based, source_auto_position=1, a high rpl_semi_sync_master_timeout).
Steps:
Disable semi-sync on replica, restart replication.
Write a txn on master, command "Waiting for semi-sync ACK from slave" as expected. The txn is executed on replica per Executed_Gtid_Set.
Re-enable semi-sync on replica, restart replication. The txn on master is still pending.
A new txn on master unblocks everything and the replication starts moving ahead. But this new txn is not necessary in 8.0.30 or older.
Venkatesh Prasad August 8, 2023 at 5:34 PMEdited
I was able to reproduce the issue. It looks I needed to start slave on the node that was working in async mode(replica 1). Upon making the replica 1 to connect as a semi sync replica, it was observed that the create database query kept hanging desipte starting the receiver thread on node1.
However this issue is specific only to GTID (source_auto_position=1) based replication and traditional file and pos based replication is unaffected.
On doing a quick analysis, I found that this is a regression from upstream commit https://github.com/mysql/mysql-server/commit/650d2f7bf275e40fd605df39ab09949aa54f5e3f which in present in 8.0.31+.
Venkatesh Prasad August 8, 2023 at 11:41 AM
I also tested with the GTID protocol (source_auto_position=1) and did not see any issue there as well.
1. Create database query waiting for more 60 seconds in "Waiting for semi-sync ACK from replica" state
source> show processlist;
+----+-----------------+-----------------+------+------------------+------+-----------------------------------------------------------------+-------------------------+---------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Time_ms | Rows_sent | Rows_examined |
+----+-----------------+-----------------+------+------------------+------+-----------------------------------------------------------------+-------------------------+---------+-----------+---------------+
| 5 | event_scheduler | localhost | NULL | Daemon | 3867 | Waiting on empty queue | NULL | 3867297 | 0 | 0 |
| 19 | root | localhost:52506 | NULL | Query | 0 | init | show processlist | 0 | 0 | 0 |
| 32 | root | localhost:57710 | NULL | Query | 59 | Waiting for semi-sync ACK from replica | create database testdb1 | 59703 | 0 | 0 |
| 55 | root | localhost:60918 | NULL | Binlog Dump GTID | 60 | Source has sent all binlog to replica; waiting for more updates | NULL | 60577 | 0 | 0 |
+----+-----------------+-----------------+------+------------------+------+-----------------------------------------------------------------+-------------------------+---------+-----------+---------------+
4 rows in set (0.00 sec)
2. Starting replication on node2
replica2> set global rpl_semi_sync_replica_enabled=ON;
Query OK, 0 rows affected (0.00 sec)
replica2> start slave;
Query OK, 0 rows affected, 1 warning (0.05 sec)
3. Create database query succeded
source> create database testdb1;
Query OK, 1 row affected (1 min 6.22 sec)
Venkatesh Prasad August 8, 2023 at 11:31 AMEdited
I also tested with the semisync_source and semisync_replica plugins and I did not see any issue there as well
1. Create database query waiting for more 30 seconds in "Waiting for semi-sync ACK from replica" state
source> show processlist;
+----+-----------------+-----------------+------+-------------+------+-----------------------------------------------------------------+---------------------+---------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Time_ms | Rows_sent | Rows_examined |
+----+-----------------+-----------------+------+-------------+------+-----------------------------------------------------------------+---------------------+---------+-----------+---------------+
| 5 | event_scheduler | localhost | NULL | Daemon | 3265 | Waiting on empty queue | NULL | 3265229 | 0 | 0 |
| 19 | root | localhost:52506 | NULL | Query | 0 | init | show processlist | 0 | 0 | 0 |
| 32 | root | localhost:57710 | NULL | Query | 31 | Waiting for semi-sync ACK from replica | create database db1 | 31412 | 0 | 0 |
| 44 | root | localhost:51654 | NULL | Binlog Dump | 37 | Source has sent all binlog to replica; waiting for more updates | NULL | 37554 | 0 | 0 |
+----+-----------------+-----------------+------+-------------+------+-----------------------------------------------------------------+---------------------+---------+-----------+---------------+
2. Starting replication on node2
replica2> set global rpl_semi_sync_replica_enabled=ON;
Query OK, 0 rows affected (0.00 sec)
replica2> start slave;
Query OK, 0 rows affected, 1 warning (0.05 sec)
3. Create database query succeded
source> create database db1;
Query OK, 1 row affected (33.99 sec)
Details
Details
Assignee
Reporter
Labels
Needs Review
Needs QA
Fix versions
Affects versions
Priority
Smart Checklist
Open Smart Checklist
Smart Checklist

The scenario below works fine in 8.0.30 but not in versions 8.0.31 and above. The same bug is also reported for Vanilla mysql https://bugs.mysql.com/bug.php?id=110127
1) Semi-sync replication enabled on 1 master and 2 replicas.
2) By default value for rpl_semi_sync_source_wait_for_replica_count = 1
3) Make one of the node as async replica by disabling semi-sync parameter.
4) Stop the another semi-sync replica.
5) Set the high timeout value for rpl_semi_sync_master_timeout so it does not convert semi-sync setup to async and waits for longer time.
6) Try to run any SQL statement. It will wait for ack for anyone replica.
7) Now enable the semi-sync replica on the node where we disabled it.
set global rpl_semi_sync_slave_enabled=ON;
When we enable one of semi-sync replica to acknowledge the transaction, master should proceed.
This works fine in 8.0.30 but does not work for 8.0.31 and above.
Steps to reproduce -
1) use dbdeployer to create semi-sync test instances, (master, node1 and node2)
dbdeployer deploy replication --nodes=3 --topology=master-slave --gtid --semi-sync --read-only-slaves 8.0.33 --force --port 5800
2) Stop one of the node. Assuming it as a crashed node or node in maintenance.
./node2/stop
3) Now convert node1 as an async replica so that there is no semi-sync replica to acknowledge the transaction.
set global rpl_semi_sync_slave_enabled = OFF; stop replica; start replica;
4) Connect to master and try to run some statement after increasing timeout to large value so it does not convert semi-sync to async.
set global rpl_semi_sync_master_timeout = 36000000; create database test1;
The thread will be stuck in waiting for ack state.
| 22 | msandbox | localhost | NULL | Query | 8 | Waiting for semi-sync ACK from slave | create database test4 | 8098 | 0 | 0 |
5) Connect to node2 and enable semi-sync on it.
set global rpl_semi_sync_slave_enabled=ON; stop replica;start replica;
As soon as we enable semi-sync on any of the replica, master should be moving ahead with the transactions as we have rpl_semi_sync_source_wait_for_replica_count = 1.
This works fine on 8.0.30 but after that version, it does not behave as it should.
Master will be stuck in the state "Waiting for semi-sync ACK from slave" even after the semi-sync replica is started.
Expected -
When async replica is converted into semi-sync replica, master should start processing the transactions when rpl_semi_sync_source_wait_for_replica_count = 1. Because the semi-sync replica will acknowledge the transaction.