Upgrading Cluster Fails When Dataset Has Large Number Of Tables

General

Escalation

General

Escalation

Description

We hit this issue while upgrading our staging cluster from 8.0.25-15.1 -> 8.0.29-21.1. It affects operator versions{{ 1.11.0 }}and{{ 1.12.0.}}

When the first pod is replaced by the operator to one with the new version, it fails to start up and gets stuck in a loop where it restarts every 120 seconds. The problem is this code from pxc-entrypoint.sh:

In the case of our staging cluster, 120 seconds is not enough time for the database to come up when doing an upgrade. We had to increase the value to 420 seconds to resolve the issue. However, that's not really a good solution because the amount of time needed increases as the number of tables in the dataset increases. Our staging cluster has 188,846 tables. Our production cluster has over 300,000 tables. So, 420 seconds surely won't be enough time for it to finish.

Also, just to be clear, this issue does not occur with operator 1.10.0 as the above referenced code does not exist in that version.

Environment

None

AFFECTED CS IDs

CS0046363

Linked work items

is blocked by

PXC-4433

Create file with current MySQL state

Activity

Eleonora Zinchenko
December 12, 2024 at 3:45 PM

Hi,

Verified.

ege.gunes
December 12, 2024 at 3:26 PM

Problem described in description should have be fixed since v1.14.0.

In v1.16.0, we implemented a state monitor using MySQL’s NOTIFY_SOCKET and improved our liveness probes.

Nickolay Ihalainen
December 4, 2024 at 10:48 AM

I’ve tried main branch (1.16.0):

During the startup liveness check failed:

After 10 minutes server restarted:

Liveness check contains:

If I run another cluster without tables with exactly the same setup, there is no container restart:

Kamil Holubicki
September 19, 2024 at 12:51 PM

Please use the “socket approach” described in and confirm if it solves the problem in production.

Nickolay Ihalainen
May 14, 2024 at 8:36 AM

The workaround is:

exec to pxc pod
execute pkill -STOP sleep
wait until mysql ready to connections
execute pkill -CONT sleep

It’s important to have a workaround described in the documentation, because we still may need upgrade possible to a version not supporting feature. The obvious workaround could be:

Add a check for sleep-forever and make the loop endless instead of 120 iterations if the file exists.

Resize issue view side panel

Done

Details

Assignee

ege.gunes

Reporter

Dustin Falgout

Needs QA

Yes

Fix versions

1.16.0

Affects versions

1.11.0

1.12.0

1.14.0

Priority

Medium

Created March 15, 2023 at 7:21 PM

Updated January 1, 2025 at 3:51 PM

Resolved December 12, 2024 at 3:45 PM