Updating PostgreSQL database from 11 to 14 may fail if supervisor is able to start PostgreSQL first before the updater
Description
How to test
How to document
AFFECTED CS IDs
Activity
Roma Novikov January 19, 2024 at 1:43 PM
For better visibility the details about such updates added to our documentation: https://docs.percona.com/percona-monitoring-and-management/how-to/upgrade.html#update-the-server
Jaime Sicam January 10, 2024 at 9:15 AM
I think I understand why PostgreSQL keeps on restarting even after the container image has been changed. When the container image is changed from say 2.23.0 to 2.41.0, at the beginning, /srv/postgres14 doesn’t exist. So when PMM server is started, supervisor will attempt to restart PostgreSQL automatically several times until it gives up because the directory does not exist. Meanwhile, the Ansible playbook will try to stop both pmm-managed and postgresql if /srv/postgresql14 does not exist. This operation will be successful but still supervisor will keep on attempting to restart PostgreSQL. Then, the playbook will attempt to create /srv/postgres14 and initialize it. If supervisor is able to start PostgreSQL before the Ansible playbook does either when performing the backup or restoring the backup then the data migration will fail and will be skipped when PMM attempts to restart the update process again because /srv/postgres14 now exists.
To reproduce:
Create a playbook for stopping pmm-managed and postgresql:
cat stop.yaml
-
hosts: localhost
tasks:
- name: Stop pmm-managed and postgres before backup database
supervisorctl:
name: "{{ item }}"
state: stopped
loop:
- pmm-managed
- postgresql
Stop postgresql and move postgres14 to another directory
supervisorctl stop postgresql
mv /srv/postgres14 /srv/postgres14backup
Attempt to start PostgreSQL with supervisor (It is a simulation as if the container just started running)
supervisorctl start postgresql
Attempt to stop PostgreSQL from playbook:
ansible-playbook stop.yaml
Check /srv/logs/supervisord.log and verify that the playbook couldn’t stop supervisor from restarting PostgreSQL:
2024-01-10 09:09:58,205 INFO spawned: 'postgresql' with pid 13884
2024-01-10 09:09:58,224 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:00,005 INFO spawned: 'postgresql' with pid 13885
2024-01-10 09:10:00,018 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:02,800 INFO spawned: 'postgresql' with pid 13894
2024-01-10 09:10:02,808 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:06,125 INFO spawned: 'postgresql' with pid 13902
2024-01-10 09:10:06,133 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:10,436 INFO spawned: 'postgresql' with pid 13919
2024-01-10 09:10:10,444 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:16,117 INFO spawned: 'postgresql' with pid 13929
2024-01-10 09:10:16,125 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:22,685 INFO spawned: 'postgresql' with pid 13988
2024-01-10 09:10:22,694 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:24,336 INFO waiting for pmm-managed to stop
2024-01-10 09:10:24,343 INFO stopped: pmm-managed (exit status 0) -- this is from the playbook
2024-01-10 09:10:24,343 INFO reaped unknown pid 14026
2024-01-10 09:10:30,002 INFO spawned: 'postgresql' with pid 14070 -- still supervisor still attempts to restart postgres
2024-01-10 09:10:30,011 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:38,109 INFO spawned: 'postgresql' with pid 14095
2024-01-10 09:10:38,118 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:47,171 INFO spawned: 'postgresql' with pid 14120
2024-01-10 09:10:47,185 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:57,864 INFO spawned: 'postgresql' with pid 14153
2024-01-10 09:10:57,873 INFO exited: postgresql (exit status 2; not expected)
2024-01-10 09:10:57,873 INFO gave up: postgresql entered FATAL state, too many start retries too quickly
Given this, it doesn’t matter what docker version is used as upgrade could potentially fail based on timing. I think the workaround is to wait until supervisor reaches the point where it will give up restarting PostgreSQL and then PMM can start updating the database.
Details
Assignee
UnassignedUnassignedReporter
Jaime SicamJaime SicamPriority
MediumComponents
Labels
Needs QA
YesNeeds Doc
Yes
Details
Details
Assignee
Reporter
Priority
Components
Labels
Needs QA
Needs Doc
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

I cannot reproduce this with docker-ce. To reproduce in CentOS 7, please install stock docker:
yum -y install docker systemctl enable docker
Install old version of PMM and then upgrade it to the latest version. You may need to perform this several times for the error to appear:
docker rm -f pmm-server pmm-data docker create --volume /srv --name pmm-data percona/pmm-server:2.23.0 docker run --detach --restart always --publish 443:443 --publish 80:80 --volumes-from pmm-data --name pmm-server percona/pmm-server:2.23.0 sleep 30 docker stop pmm-server docker rm pmm-server docker run --detach --restart always --publish 443:443 --publish 80:80 --volumes-from pmm-data --name pmm-server percona/pmm-server:2.41.0-el7
The logs will show that PostgreSQL 14 failed to start during the upgrade because it was started by supervisor.
TASK [Stop pmm-managed and postgres before backup database | EL7] ************** changed: [localhost] => (item=pmm-managed) ok: [localhost] => (item=postgresql) TASK [Stop pmm-managed and postgres before backup database | EL9] ************** skipping: [localhost] => (item=pmm-managed) skipping: [localhost] => (item=postgresql) TASK [postgres : Run Postgres database without supervisor] ********************* changed: [localhost] TASK [postgres : Dump pmm-managed database] ************************************ changed: [localhost] TASK [postgres : Stop Postgres database without supervisor] ******************** changed: [localhost] TASK [postgres : Start Postgres 14 database without supervisor] **************** fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/pgsql-14/bin/pg_ctl", "start", "-D", "/srv/postgres14", "-o", "-c logging_collector=off"], "delta": "0:00:00.540724", "end": "2024-01-09 07:40:41.022439", "msg": "non-zero return code", "rc": 1, "start": "2024-01-09 07:40:40.481715", "stderr": "pg_ctl: another server might be running; trying to start server anyway\npg_ctl: could not start server\nExamine the log output.", "stderr_lines": ["pg_ctl: another server might be running; trying to start server anyway", "pg_ctl: could not start server", "Examine the log output."], "stdout": "waiting for server to start....2024-01-09 07:40:40.923 UTC [1662] FATAL: lock file \"postmaster.pid\" already exists\n2024-01-09 07:40:40.923 UTC [1662] HINT: Is another postmaster (PID 1629) running in data directory \"/srv/postgres14\"?\n stopped waiting", "stdout_lines": ["waiting for server to start....2024-01-09 07:40:40.923 UTC [1662] FATAL: lock file \"postmaster.pid\" already exists", "2024-01-09 07:40:40.923 UTC [1662] HINT: Is another postmaster (PID 1629) running in data directory \"/srv/postgres14\"?", " stopped waiting"]}
Because /srv/postgres14 exists, the database migration process will be skipped on the next iteration of updating PMM:
# cat /srv/logs/pmm-update-perform-init.log|grep -A 1 -B 1 Restore TASK [postgres : Restore pmm-managed database] ********************************* skipping: [localhost] -- TASK [postgres : Restore pmm-managed database] ********************************* skipping: [localhost]
You can see from supervisord.log that PostgreSQL is being restarted by supervisor:
cat /srv/logs/supervisord.log|grep postgres 2024-01-09 07:39:36,030 INFO spawned: 'postgresql' with pid 14 2024-01-09 07:39:37,046 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2024-01-09 07:40:04,952 INFO waiting for alertmanager, postgresql, pmm-agent, victoriametrics, grafana, cron, nginx, pmm-managed, qan-api2, vmalert, clickhouse to die 2024-01-09 07:40:08,040 INFO waiting for postgresql, clickhouse to die 2024-01-09 07:40:08,078 INFO stopped: postgresql (exit status 0) 2024-01-09 07:40:09,943 INFO spawned: 'postgresql' with pid 16 2024-01-09 07:40:10,179 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:11,220 INFO spawned: 'postgresql' with pid 95 2024-01-09 07:40:11,276 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:13,479 INFO spawned: 'postgresql' with pid 418 2024-01-09 07:40:13,516 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:17,335 INFO spawned: 'postgresql' with pid 591 2024-01-09 07:40:17,374 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:21,505 INFO spawned: 'postgresql' with pid 919 2024-01-09 07:40:21,536 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:26,542 INFO spawned: 'postgresql' with pid 1098 2024-01-09 07:40:26,554 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:32,894 INFO spawned: 'postgresql' with pid 1373 2024-01-09 07:40:32,909 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:40,215 INFO spawned: 'postgresql' with pid 1629 2024-01-09 07:40:42,117 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
However, notice that it has not reached fatal state before not autostarting Postgresql for good.
Eg.
2024-01-09 07:51:22,037 INFO gave up: postgresql entered FATAL state, too many start retries too quickly
Perhaps the workaround is to add a delay before starting PostgreSQL without supervisor to allow supervisor to mark the postgresql as FATAL and ensure that no other instance of Postgres is running:
- name: Pause for 1 minute before starting Postgres 11 ansible.builtin.pause minutes: 1 - name: Run Postgres database without supervisor command: /usr/pgsql-11/bin/pg_ctl start -D /srv/postgres become: true become_user: postgres become_method: su when: is_upgrade.stat.exists - name: Dump pmm-managed database postgresql_db: name: pmm-managed state: dump target: /tmp/pmm-managed-upgrade.sql when: is_upgrade.stat.exists - name: Stop Postgres database without supervisor command: /usr/pgsql-11/bin/pg_ctl stop -D /srv/postgres become: true become_user: postgres become_method: su when: is_upgrade.stat.exists - name: Pause for 1 minute before starting Postgres 14 ansible.builtin.pause minutes: 1 - name: Start Postgres 14 database without supervisor command: /usr/pgsql-14/bin/pg_ctl start -D /srv/postgres14 -o "-c logging_collector=off" become: true become_user: postgres become_method: su