Upgrade from Operator 1.3 to 1.4 is ending up with cluster without any replicas

General

Escalation

General

Escalation

Description

Independently tested and reproduce the customer case.

Existing cluster with PGO 1.3

$kubectl get pods
NAME READY STATUS RESTARTS AGE
postgres-operator-64b576cf76-4g54w 4/4 Running 0 16m
pgo-deploy--1-9nh9f 0/1 Completed 0 17m
cluster1-backrest-shared-repo-5c78d4fb49-9dnxs 1/1 Running 0 14m
cluster1-pgbouncer-6b9fbb7954-btq2z 1/1 Running 0 12m
cluster1-pgbouncer-6b9fbb7954-5tfdc 1/1 Running 0 12m
cluster1-pgbouncer-6b9fbb7954-xzqq8 1/1 Running 0 12m
backrest-backup-cluster1--1-2mps2 0/1 Completed 0 10m
cluster1-694699d8f4-vmq7b 1/1 Running 0 14m
cluster1-repl1-6b9677bcf7-2pwp9 1/1 Running 0 10m
cluster1-repl2-55c6dd8f5-rjswq 1/1 Running 0 10m

Connect to primary and verify everything

$kubectl exec -it cluster1-694699d8f4-vmq7b – bash
bash-4.4$ 
bash-4.4$ 
bash-4.4$ psql
psql (14.4)
Type "help" for help.
postgres=# create table t1 (id int,nm varchar);
CREATE TABLE
postgres=# insert into t1 values (1,'Jobin');
INSERT 0 1

- Please not the PostgreSQL version 14.4

Pause the cluster

$kubectl edit perconapgclusters.pg.percona.com cluster1 
perconapgcluster.pg.percona.com/cluster1 edited

Verify that everything is paused

$kubectl get pods
NAME READY STATUS RESTARTS AGE
postgres-operator-64b576cf76-4g54w 4/4 Running 0 22m
pgo-deploy--1-9nh9f 0/1 Completed 0 23m
backrest-backup-cluster1--1-2mps2 0/1 Completed 0 16m

Delete following from the existing cluster

kubectl delete \
serviceaccounts/pgo-deployer-sa \
clusterroles/pgo-deployer-cr \
configmaps/pgo-deployer-cm \
configmaps/pgo-config \
clusterrolebindings/pgo-deployer-crb \
jobs.batch/pgo-deploy \
deployment/postgres-operator

Get the files for the Operator 1.4

cd ~/Downloads/
rm -rf percona-postgresql-operator
git clone -b v1.4.0 https://github.com/percona/percona-postgresql-operator
cd percona-postgresql-operator

Edit the `operator.yaml` file of 1.4 to modify `DEPLOY_ACTION` to `update`

vi deploy/operator.yaml

...
      env:
        - name: DEPLOY_ACTION
          value: update
...

Create the operator

$kubectl apply -f operator.yaml
serviceaccount/pgo-deployer-sa created
clusterrole.rbac.authorization.k8s.io/pgo-deployer-cr created
configmap/pgo-deployer-cm created
clusterrolebinding.rbac.authorization.k8s.io/pgo-deployer-crb created
job.batch/pgo-deploy created

Wait for two mintues and verify that Operator is ready

sleep 120
kubectl get pods
NAME READY STATUS RESTARTS AGE
backrest-backup-cluster1--1-2mps2 0/1 Completed 0 27m
postgres-operator-6658579bdd-jlm76 4/4 Running 0 2m22s
pgo-deploy--1-qrbzn 0/1 Completed 0 2m59s

verify `cr.yaml` for compatible images. (PostgreSQL version) and apply the `cr.yaml`

$kubectl apply -f cr.yaml 
perconapgcluster.pg.percona.com/cluster1 configured

Wait for 3 minutes+ and check the status

$kubectl get pods
NAME READY STATUS RESTARTS AGE
backrest-backup-cluster1--1-2mps2 0/1 Completed 0 36m
postgres-operator-6658579bdd-jlm76 4/4 Running 0 11m
pgo-deploy--1-qrbzn 0/1 Completed 0 11m
cluster1-backrest-shared-repo-75f4dd76d8-hs5jr 1/1 Running 0 7m4s
cluster1-pgbouncer-6875db664f-sxn9c 1/1 Running 0 4m19s
cluster1-pgbouncer-6875db664f-nr229 1/1 Running 0 4m19s
cluster1-pgbouncer-6875db664f-6l9dw 1/1 Running 0 4m19s
cluster1-backrest-shared-repo-75f4dd76d8-gdz82 1/1 Running 0 4m19s
cluster1-backrest-shared-repo-75f4dd76d8-7jn4g 1/1 Running 0 4m19s
cluster1-649b45fd6d-5nms5 1/1 Running 0 5m5s

We can see that there is no replica pods.

Environment

None

AFFECTED CS IDs

CS0037817

Activity

Show:

Slava Sarzhan July 19, 2023 at 1:02 PM

@Jobin Augustine I can confirm that we have a bug with a pause in 1.4.0, and we need to fix it. Please use https://docs.percona.com/percona-operator-for-postgresql/update.html new version of the official upgrade instruction, to update the operators. You will have an upgrade without any downtime at all.

Jobin Augustine July 18, 2023 at 7:46 AM

Hi @natalia.marukovich
Regarding,
> But why do you use this procedure? We have another steps in the doc https://docs.percona.com/percona-operator-for-postgresql/update.html for version 1.4.0
> Steps from doc work for me.

Really it works for you?. I am surprised . @Lalit Choudhary has already created another ticket for the poor documentation : https://jira.percona.com/browse/K8SPG-403
That official document contains only one of the step in the entire multi-step manual process/steps for "Upgrading the Operator"
There are pre and post steps required. Please see the comments by Lalit and Ege on the ticket https://perconadev.atlassian.net/browse/K8SPG-403#icft=K8SPG-403
The single step mentioned in the official documentation is part the steps I documented here also . please see the section : Edit the `operator.yaml` file of 1.4 to modify `DEPLOY_ACTION` to `update`

natalia.marukovich July 17, 2023 at 9:28 AM
Edited

@Jobin Augustine
I'm rechecking your steps. But why do you use this procedure? We have another steps in the doc https://docs.percona.com/percona-operator-for-postgresql/update.html for version 1.4.0
Steps from doc work for me.

Lalit Choudhary July 17, 2023 at 6:04 AM

Workaround: After the upgrade, scale the replicas manually.

Example:

kubectl -n pgo scale --replicas=2 perconapgcluster/cluster1

Done

Details
Assignee
Unassigned
Reporter
Jobin Augustine
Needs QA
Yes
Components
Support Request
Fix versions
1.5.0
Affects versions
1.4.0
Priority
Medium

Smart Checklist

Created July 14, 2023 at 4:59 AM

Updated March 8, 2024 at 2:05 PM

Resolved August 31, 2023 at 7:45 AM

Configure

Upgrade from Operator 1.3 to 1.4 is ending up with cluster without any replicas

Description

Environment

AFFECTED CS IDs

Activity

Slava Sarzhan July 19, 2023 at 1:02 PM

Jobin Augustine July 18, 2023 at 7:46 AM

natalia.marukovich July 17, 2023 at 9:28 AMEdited

Lalit Choudhary July 17, 2023 at 6:04 AM

DetailsAssigneeUnassignedUnassignedReporterJobin AugustineJobin AugustineNeeds QAYesComponentsSupport RequestFix versions1.5.0Affects versions1.4.0PriorityMedium

Details

Assignee

Reporter

Needs QA

Components

Fix versions

Affects versions

Priority

Smart ChecklistOpen Smart Checklist

Smart Checklist

natalia.marukovich July 17, 2023 at 9:28 AM
Edited

Details
Assignee
Unassigned
Reporter
Jobin Augustine
Needs QA
Yes
Components
Support Request
Fix versions
1.5.0
Affects versions
1.4.0
Priority
Medium

Smart Checklist