Upgrading Cluster Fails When Dataset Has Large Number Of Tables
General
Escalation
General
Escalation
Description
We hit this issue while upgrading our staging cluster from 8.0.25-15.1 -> 8.0.29-21.1. It affects operator versions{{ 1.11.0 }}and{{ 1.12.0.}}
When the first pod is replaced by the operator to one with the new version, it fails to start up and gets stuck in a loop where it restarts every 120 seconds. The problem is this code from pxc-entrypoint.sh:
In the case of our staging cluster, 120 seconds is not enough time for the database to come up when doing an upgrade. We had to increase the value to 420 seconds to resolve the issue. However, that's not really a good solution because the amount of time needed increases as the number of tables in the dataset increases. Our staging cluster has 188,846 tables. Our production cluster has over 300,000 tables. So, 420 seconds surely won't be enough time for it to finish.
Also, just to be clear, this issue does not occur with operator 1.10.0 as the above referenced code does not exist in that version.
Problem described in description should have be fixed since v1.14.0.
In v1.16.0, we implemented a state monitor using MySQL’s NOTIFY_SOCKET and improved our liveness probes.
Nickolay Ihalainen December 4, 2024 at 10:48 AM
I’ve tried main branch (1.16.0):
During the startup liveness check failed:
After 10 minutes server restarted:
Liveness check contains:
If I run another cluster without tables with exactly the same setup, there is no container restart:
Kamil Holubicki September 19, 2024 at 12:51 PM
Please use the “socket approach” described in and confirm if it solves the problem in production.
Nickolay Ihalainen May 14, 2024 at 8:36 AM
The workaround is:
exec to pxc pod
execute pkill -STOP sleep
wait until mysql ready to connections
execute pkill -CONT sleep
It’s important to have a workaround described in the documentation, because we still may need upgrade possible to a version not supporting feature. The obvious workaround could be:
Add a check for sleep-forever and make the loop endless instead of 120 iterations if the file exists.
We hit this issue while upgrading our staging cluster from
8.0.25-15.1
->8.0.29-21.1
. It affects operator versions{{ 1.11.0 }}and{{ 1.12.0.}}When the first pod is replaced by the operator to one with the new version, it fails to start up and gets stuck in a loop where it restarts every 120 seconds. The problem is this code from pxc-entrypoint.sh:
In the case of our staging cluster,
120
seconds is not enough time for the database to come up when doing an upgrade. We had to increase the value to420
seconds to resolve the issue. However, that's not really a good solution because the amount of time needed increases as the number of tables in the dataset increases. Our staging cluster has188,846
tables. Our production cluster has over300,000
tables. So,420
seconds surely won't be enough time for it to finish.Also, just to be clear, this issue does not occur with operator
1.10.0
as the above referenced code does not exist in that version.