Issue when re-joining node after clean shutdown + long hangs

Description

I have the following setup for cluster to cluster replication :

(Source Legacy) (Mysql 5.7.36-1debian10) (master for replica) gtid_mode=ON_PERMISSIVE

(Source Cluster) 4 Nodes Galera Cluster (xtradb 8.0.2-16) (node 3 is slave, node 4 is master replica) gtid_mode=ON

(Target Cluster) 3 Nodes Galera Cluster (xtradb 8.0.2-16) (node 3 is slave replica) gtid_mode=ON

Whenever I cleanly shut down a node on target cluster, the node will never re-join cluster on the first try. Every time, it will just hang there, stuck on JOINED, until I restart the node again, then it will take a bit of time and show SYNCHED.

 

I have reproduced this on two completely independent target clusters. The problem is reproduced every time.

If I turn off replica before attempting a restart, sometimes it will re-join on first attempt, but then the replication will stay stuck for a while before it starts replicating again.

I have not been able to reproduce this on 5.7.36-39

Delay between COSED and SYNCHED on 8.0.x (Ignoring 1st attempt that failed):
2022-04-19T*20:04:13*.737230Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> CLOSED (TO: 136473017)

. . .
2022-04-19T*20:11:35*.043387Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 136568537)

Delay between CLOSED and SYNCHED on 5.7.x (On 1st attempt):
2022-04-19T*20:23:46*.663530Z 0 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 3960502)
. . .
2022-04-19T*20:24:17*.521708Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 3960515)

 

Delay between CLOSED and SYNCHED on 8.0.x (with replication stopped, 1st attempt)
2022-04-19T*20:36:15*.172049Z 0 [Note] [MY-000000] [Galera] Shifting OPEN -> CLOSED (TO: 136852454)

. . .

2022-04-19T*20:36:44*.884421Z 0 [Note] [MY-000000] [Galera] Shifting JOINED -> SYNCED (TO: 136852456)
. . .
When replication is started again, I get these errors (multiple times) and the replication will not propagate (hangs) until I restart that node again. Those same errors are also present on 1st attempt when replication is kept on.

2022-04-19T20:38:29.305201Z 10 [Warning] [MY-000000] [Galera] trx protocol version: 5 does not match certification protocol version: -1

I will post configs and logs in separate update.

Environment

Debian Buster, latest updates, no firewall.

Attachments

10

Smart Checklist

Activity

Show:

Aaditya Dubey March 26, 2024 at 8:35 AM

Hi

Due to other priority tasks, we couldn't take a look further. We will take a look and update you shortly.

Marc Bernard March 22, 2024 at 3:53 PM

It is quite disappointing that nobody even took time to look at the error code to point to why it is crashing.

I understand EOL, but there are still millions of people still using 5.7 and a lot of systems will continue using 5.7 for a while.

Aaditya Dubey January 15, 2024 at 2:59 PM

Hi

Extremely sorry that you couldn't get the required feedback, However i’m working on this to repeat so as it repeats & fixed then you may go to 8.0.
Thank you connecting Percona!

Marc Bernard August 9, 2023 at 5:34 PM

Unfortunately, I have abandoned MySQL 8 and since reverted to 5.7 due to lack of feedback here.

Aaditya Dubey January 19, 2023 at 10:06 AM

Hi ,

Please let me know if issue is persists?

Details

Assignee

Reporter

Affects versions

Priority

Smart Checklist

Created April 19, 2022 at 8:56 PM
Updated September 7, 2024 at 3:28 PM