Updating PostgreSQL database from 11 to 14 may fail if supervisor is able to start PostgreSQL first before the updater

General

Escalation

General

Escalation

Description

I cannot reproduce this with docker-ce. To reproduce in CentOS 7, please install stock docker:

yum -y install docker
systemctl enable docker

Install old version of PMM and then upgrade it to the latest version. You may need to perform this several times for the error to appear:

docker rm -f pmm-server pmm-data
docker create --volume /srv --name pmm-data percona/pmm-server:2.23.0
docker run --detach --restart always --publish 443:443 --publish 80:80 --volumes-from pmm-data --name pmm-server percona/pmm-server:2.23.0
sleep 30
docker stop pmm-server
docker rm pmm-server
docker run --detach --restart always --publish 443:443 --publish 80:80 --volumes-from pmm-data --name pmm-server percona/pmm-server:2.41.0-el7

The logs will show that PostgreSQL 14 failed to start during the upgrade because it was started by supervisor.

TASK [Stop pmm-managed and postgres before backup database | EL7] **************
changed: [localhost] => (item=pmm-managed)
ok: [localhost] => (item=postgresql)

TASK [Stop pmm-managed and postgres before backup database | EL9] **************
skipping: [localhost] => (item=pmm-managed)
skipping: [localhost] => (item=postgresql) 

TASK [postgres : Run Postgres database without supervisor] *********************
changed: [localhost]

TASK [postgres : Dump pmm-managed database] ************************************
changed: [localhost]

TASK [postgres : Stop Postgres database without supervisor] ********************
changed: [localhost]

TASK [postgres : Start Postgres 14 database without supervisor] ****************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/pgsql-14/bin/pg_ctl", "start", "-D", "/srv/postgres14", "-o", "-c logging_collector=off"], "delta": "0:00:00.540724", "end": "2024-01-09 07:40:41.022439", "msg": "non-zero return code", "rc": 1, "start": "2024-01-09 07:40:40.481715", "stderr": "pg_ctl: another server might be running; trying to start server anyway\npg_ctl: could not start server\nExamine the log output.", "stderr_lines": ["pg_ctl: another server might be running; trying to start server anyway", "pg_ctl: could not start server", "Examine the log output."], "stdout": "waiting for server to start....2024-01-09 07:40:40.923 UTC [1662] FATAL:  lock file \"postmaster.pid\" already exists\n2024-01-09 07:40:40.923 UTC [1662] HINT:  Is another postmaster (PID 1629) running in data directory \"/srv/postgres14\"?\n stopped waiting", "stdout_lines": ["waiting for server to start....2024-01-09 07:40:40.923 UTC [1662] FATAL:  lock file \"postmaster.pid\" already exists", "2024-01-09 07:40:40.923 UTC [1662] HINT:  Is another postmaster (PID 1629) running in data directory \"/srv/postgres14\"?", " stopped waiting"]}

Because /srv/postgres14 exists, the database migration process will be skipped on the next iteration of updating PMM:

# cat /srv/logs/pmm-update-perform-init.log|grep -A 1 -B 1 Restore

TASK [postgres : Restore pmm-managed database] *********************************
skipping: [localhost]
--

TASK [postgres : Restore pmm-managed database] *********************************
skipping: [localhost]

You can see from supervisord.log that PostgreSQL is being restarted by supervisor:

cat /srv/logs/supervisord.log|grep postgres
2024-01-09 07:39:36,030 INFO spawned: 'postgresql' with pid 14
2024-01-09 07:39:37,046 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-01-09 07:40:04,952 INFO waiting for alertmanager, postgresql, pmm-agent, victoriametrics, grafana, cron, nginx, pmm-managed, qan-api2, vmalert, clickhouse to die
2024-01-09 07:40:08,040 INFO waiting for postgresql, clickhouse to die
2024-01-09 07:40:08,078 INFO stopped: postgresql (exit status 0)
2024-01-09 07:40:09,943 INFO spawned: 'postgresql' with pid 16
2024-01-09 07:40:10,179 INFO exited: postgresql (exit status 2; not expected)
2024-01-09 07:40:11,220 INFO spawned: 'postgresql' with pid 95
2024-01-09 07:40:11,276 INFO exited: postgresql (exit status 2; not expected)
2024-01-09 07:40:13,479 INFO spawned: 'postgresql' with pid 418
2024-01-09 07:40:13,516 INFO exited: postgresql (exit status 2; not expected)
2024-01-09 07:40:17,335 INFO spawned: 'postgresql' with pid 591
2024-01-09 07:40:17,374 INFO exited: postgresql (exit status 2; not expected)
2024-01-09 07:40:21,505 INFO spawned: 'postgresql' with pid 919
2024-01-09 07:40:21,536 INFO exited: postgresql (exit status 2; not expected)
2024-01-09 07:40:26,542 INFO spawned: 'postgresql' with pid 1098
2024-01-09 07:40:26,554 INFO exited: postgresql (exit status 2; not expected)
2024-01-09 07:40:32,894 INFO spawned: 'postgresql' with pid 1373
2024-01-09 07:40:32,909 INFO exited: postgresql (exit status 2; not expected)
2024-01-09 07:40:40,215 INFO spawned: 'postgresql' with pid 1629
2024-01-09 07:40:42,117 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 

However, notice that it has not reached fatal state before not autostarting Postgresql for good.

Eg.

2024-01-09 07:51:22,037 INFO gave up: postgresql entered FATAL state, too many start retries too quickly

Perhaps the workaround is to add a delay before starting PostgreSQL without supervisor to allow supervisor to mark the postgresql as FATAL and ensure that no other instance of Postgres is running:

- name: Pause for 1 minute before starting Postgres 11
  ansible.builtin.pause
    minutes: 1
    
- name: Run Postgres database without supervisor
  command: /usr/pgsql-11/bin/pg_ctl start -D /srv/postgres
  become: true
  become_user: postgres
  become_method: su
  when: is_upgrade.stat.exists

- name: Dump pmm-managed database
  postgresql_db:
    name: pmm-managed
    state: dump
    target: /tmp/pmm-managed-upgrade.sql
  when: is_upgrade.stat.exists

- name: Stop Postgres database without supervisor
  command: /usr/pgsql-11/bin/pg_ctl stop -D /srv/postgres
  become: true
  become_user: postgres
  become_method: su
  when: is_upgrade.stat.exists

- name: Pause for 1 minute before starting Postgres 14
  ansible.builtin.pause
    minutes: 1
    
- name: Start Postgres 14 database without supervisor
  command: /usr/pgsql-14/bin/pg_ctl start -D /srv/postgres14 -o "-c logging_collector=off"
  become: true
  become_user: postgres
  become_method: su

How to test

None

How to document

None

AFFECTED CS IDs

CS0042644

Activity

Roma Novikov January 19, 2024 at 1:43 PM

For better visibility the details about such updates added to our documentation: https://docs.percona.com/percona-monitoring-and-management/how-to/upgrade.html#update-the-server

Jaime Sicam January 10, 2024 at 9:15 AM

I think I understand why PostgreSQL keeps on restarting even after the container image has been changed. When the container image is changed from say 2.23.0 to 2.41.0, at the beginning, /srv/postgres14 doesn’t exist. So when PMM server is started, supervisor will attempt to restart PostgreSQL automatically several times until it gives up because the directory does not exist. Meanwhile, the Ansible playbook will try to stop both pmm-managed and postgresql if /srv/postgresql14 does not exist. This operation will be successful but still supervisor will keep on attempting to restart PostgreSQL. Then, the playbook will attempt to create /srv/postgres14 and initialize it. If supervisor is able to start PostgreSQL before the Ansible playbook does either when performing the backup or restoring the backup then the data migration will fail and will be skipped when PMM attempts to restart the update process again because /srv/postgres14 now exists.

To reproduce:

Create a playbook for stopping pmm-managed and postgresql:

cat stop.yaml
-
  hosts: localhost
  tasks:
  - name: Stop pmm-managed and postgres before backup database
    supervisorctl:
      name: "{{ item }}"
      state: stopped
    loop:
      - pmm-managed
      - postgresql

Stop postgresql and move postgres14 to another directory

supervisorctl stop postgresql
mv /srv/postgres14 /srv/postgres14backup

Attempt to start PostgreSQL with supervisor (It is a simulation as if the container just started running)

supervisorctl start postgresql

Attempt to stop PostgreSQL from playbook:

ansible-playbook stop.yaml

Check /srv/logs/supervisord.log and verify that the playbook couldn’t stop supervisor from restarting PostgreSQL:

2024-01-10 09:09:58,205 INFO spawned: 'postgresql' with pid 13884
2024-01-10 09:09:58,224 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:00,005 INFO spawned: 'postgresql' with pid 13885
2024-01-10 09:10:00,018 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:02,800 INFO spawned: 'postgresql' with pid 13894
2024-01-10 09:10:02,808 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:06,125 INFO spawned: 'postgresql' with pid 13902
2024-01-10 09:10:06,133 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:10,436 INFO spawned: 'postgresql' with pid 13919
2024-01-10 09:10:10,444 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:16,117 INFO spawned: 'postgresql' with pid 13929
2024-01-10 09:10:16,125 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:22,685 INFO spawned: 'postgresql' with pid 13988
2024-01-10 09:10:22,694 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:24,336 INFO waiting for pmm-managed to stop
2024-01-10 09:10:24,343 INFO stopped: pmm-managed (exit status 0) -- this is from the playbook
2024-01-10 09:10:24,343 INFO reaped unknown pid 14026
2024-01-10 09:10:30,002 INFO spawned: 'postgresql' with pid 14070 -- still supervisor still attempts to restart postgres
2024-01-10 09:10:30,011 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:38,109 INFO spawned: 'postgresql' with pid 14095
2024-01-10 09:10:38,118 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:47,171 INFO spawned: 'postgresql' with pid 14120
2024-01-10 09:10:47,185 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:57,864 INFO spawned: 'postgresql' with pid 14153
2024-01-10 09:10:57,873 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:57,873 INFO gave up: postgresql entered FATAL state, too many start retries too quickly

Given this, it doesn’t matter what docker version is used as upgrade could potentially fail based on timing. I think the workaround is to wait until supervisor reaches the point where it will give up restarting PostgreSQL and then PMM can start updating the database.

Done

Details
Assignee
Unassigned
Reporter
Jaime Sicam
Priority
Medium
Components
Documentation
Labels
ES
Needs QA
Yes
Needs Doc
Yes
Affects versions
2.40.0
2.41.0

Smart Checklist

Created January 9, 2024 at 9:25 AM

Updated March 5, 2024 at 9:53 PM

Resolved February 1, 2024 at 11:26 AM

Updating PostgreSQL database from 11 to 14 may fail if supervisor is able to start PostgreSQL first before the updater

Description

How to test

How to document

AFFECTED CS IDs

Activity

Roma Novikov January 19, 2024 at 1:43 PM

Jaime Sicam January 10, 2024 at 9:15 AM

Details
Assignee
Unassigned
Reporter
Jaime Sicam
Priority
Medium
Components
Documentation
Labels
ES
Needs QA
Yes
Needs Doc
Yes
Affects versions
2.40.0
2.41.0

Details

Assignee

Reporter

Priority

Components

Labels

Needs QA

Needs Doc

Affects versions

Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Updating PostgreSQL database from 11 to 14 may fail if supervisor is able to start PostgreSQL first before the updater

Description

How to test

How to document

AFFECTED CS IDs

Activity

Roma Novikov January 19, 2024 at 1:43 PM

Jaime Sicam January 10, 2024 at 9:15 AM

DetailsAssigneeUnassignedUnassignedReporterJaime SicamJaime SicamPriorityMediumComponentsDocumentationLabelsESNeeds QAYesNeeds DocYesAffects versions2.40.02.41.0

Details

Assignee

Reporter

Priority

Components

Labels

Needs QA

Needs Doc

Affects versions

Smart ChecklistOpen Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Details
Assignee
Unassigned
Reporter
Jaime Sicam
Priority
Medium
Components
Documentation
Labels
ES
Needs QA
Yes
Needs Doc
Yes
Affects versions
2.40.0
2.41.0

Smart Checklist