Restart DB Cluster restarts only one PXC server and one proxySQL, not all of them

Description

Impact on the user:

  • user will not get what they've tried to achieve by restarting the DB Cluster

Steps to reproduce:

  1. Setup PXC cluster in PMM

  2. wait it's ready

  3. Restart it

  4. Wait it's restarted

  5. check status in k8s

Actual result:
spronin-test-proxysql-0 3/3 Running 6 13h spronin-test-proxysql-1 3/3 Running 0 13h spronin-test-proxysql-2 0/3 ContainerCreating 0 1s spronin-test-pxc-0 1/1 Running 0 13h spronin-test-pxc-1 1/1 Running 8 13h spronin-test-pxc-2 1/1 Terminating 0 2m29s

as you can see only one PXC and one proxySQL pod's started recently, others are 13h Up

Expected Result:
All pods have recent uptime

Workaround:
restart pods from kubectl

Suggested Implementation:

  • When dbaas-controller gets Restart request

  • Pause DB cluster

  • Wait until DB cluster is paused

    • Start new goroutine to Resume DB cluster

  • Return response

Possible issues:

  • If dbaas-controller restarts during restart DB cluster may stuck in pause state

  • After pausing DB cluster it may have active status for a few seconds (PMM-7397) 

Details:

 implemented database cluster restart with kubectl rollout restart. It is good enough for alpha, but not good enough for beta: we can lose data this way, and operators' team confirmed it is not safe.

For beta1, we should implement restart via full cluster pause/resume: https://www.percona.com/doc/kubernetes-operator-for-pxc/pause.html

The same functionality will be available for PSMDB.

How to test

Just follow the steps to reproduce. They won't be reproducible anymore

How to document

None

Smart Checklist

Activity

Show:

Andrei Minkin January 9, 2023 at 3:06 PM

It was done during the and

Andrei Minkin November 24, 2022 at 3:59 PM

https://github.com/percona/dbaas-operator/commit/e4bcd7db7c51280988a821a67681a27a220db397 The FIX I'll apply the same pattern for psmdb clusters also. Waiting for     to be fixed 

Diogo Recharte November 9, 2022 at 2:26 PM

This bug will be fixed with the architectural changes done in .

Alexey Palazhchenko January 20, 2021 at 4:35 PM

We will use "pause" functionality as discussed earlier at and other places.

Sergey Pronin January 20, 2021 at 6:09 AM

Why don't we use 'pause' functionality of the operator for this? The same you use for suspend/resume.

Done

Details

Assignee

Reporter

Priority

Components

Needs QA

Yes

Planned Version/s

Fix versions

Story Points

Smart Checklist

Created January 19, 2021 at 9:43 PM
Updated March 6, 2024 at 3:20 AM
Resolved January 10, 2023 at 11:13 AM