Upgrading Cluster Fails When Dataset Has Large Number Of Tables

Description

We hit this issue while upgrading our staging cluster from 8.0.25-15.1 -> 8.0.29-21.1. It affects operator versions{{ 1.11.0 }}and{{ 1.12.0.}}

When the first pod is replaced by the operator to one with the new version, it fails to start up and gets stuck in a loop where it restarts every 120 seconds. The problem is this code from pxc-entrypoint.sh:

In the case of our staging cluster, 120 seconds is not enough time for the database to come up when doing an upgrade. We had to increase the value to 420 seconds to resolve the issue. However, that's not really a good solution because the amount of time needed increases as the number of tables in the dataset increases. Our staging cluster has 188,846 tables. Our production cluster has over 300,000 tables. So, 420 seconds surely won't be enough time for it to finish. 

Also, just to be clear, this issue does not occur with operator 1.10.0 as the above referenced code does not exist in that version.

Environment

None

AFFECTED CS IDs

CS0046363

is blocked by

Activity

Show:

Eleonora Zinchenko December 12, 2024 at 3:45 PM

Hi,

Verified.

ege.gunes December 12, 2024 at 3:26 PM

Problem described in description should have be fixed since v1.14.0.

In v1.16.0, we implemented a state monitor using MySQL’s NOTIFY_SOCKET and improved our liveness probes.

Nickolay Ihalainen December 4, 2024 at 10:48 AM

I’ve tried main branch (1.16.0):

 

During the startup liveness check failed:

After 10 minutes server restarted:

 

Liveness check contains:

 

If I run another cluster without tables with exactly the same setup, there is no container restart:

Kamil Holubicki September 19, 2024 at 12:51 PM

Please use the “socket approach” described in and confirm if it solves the problem in production.

Nickolay Ihalainen May 14, 2024 at 8:36 AM

The workaround is:

  1. exec to pxc pod

  2. execute pkill -STOP sleep

  3. wait until mysql ready to connections

  4. execute pkill -CONT sleep

It’s important to have a workaround described in the documentation, because we still may need upgrade possible to a version not supporting feature. The obvious workaround could be:

Add a check for sleep-forever and make the loop endless instead of 120 iterations if the file exists.

Done

Details

Assignee

Reporter

Needs QA

Yes

Fix versions

Affects versions

Priority

Smart Checklist

Created March 15, 2023 at 7:21 PM
Updated January 1, 2025 at 3:51 PM
Resolved December 12, 2024 at 3:45 PM