Updating PostgreSQL database from 11 to 14 may fail if supervisor is able to start PostgreSQL first before the updater

Description

I cannot reproduce this with docker-ce. To reproduce in CentOS 7, please install stock docker:

yum -y install docker systemctl enable docker

Install old version of PMM and then upgrade it to the latest version. You may need to perform this several times for the error to appear:

docker rm -f pmm-server pmm-data docker create --volume /srv --name pmm-data percona/pmm-server:2.23.0 docker run --detach --restart always --publish 443:443 --publish 80:80 --volumes-from pmm-data --name pmm-server percona/pmm-server:2.23.0 sleep 30 docker stop pmm-server docker rm pmm-server docker run --detach --restart always --publish 443:443 --publish 80:80 --volumes-from pmm-data --name pmm-server percona/pmm-server:2.41.0-el7

The logs will show that PostgreSQL 14 failed to start during the upgrade because it was started by supervisor.

TASK [Stop pmm-managed and postgres before backup database | EL7] ************** changed: [localhost] => (item=pmm-managed) ok: [localhost] => (item=postgresql) TASK [Stop pmm-managed and postgres before backup database | EL9] ************** skipping: [localhost] => (item=pmm-managed) skipping: [localhost] => (item=postgresql) TASK [postgres : Run Postgres database without supervisor] ********************* changed: [localhost] TASK [postgres : Dump pmm-managed database] ************************************ changed: [localhost] TASK [postgres : Stop Postgres database without supervisor] ******************** changed: [localhost] TASK [postgres : Start Postgres 14 database without supervisor] **************** fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["/usr/pgsql-14/bin/pg_ctl", "start", "-D", "/srv/postgres14", "-o", "-c logging_collector=off"], "delta": "0:00:00.540724", "end": "2024-01-09 07:40:41.022439", "msg": "non-zero return code", "rc": 1, "start": "2024-01-09 07:40:40.481715", "stderr": "pg_ctl: another server might be running; trying to start server anyway\npg_ctl: could not start server\nExamine the log output.", "stderr_lines": ["pg_ctl: another server might be running; trying to start server anyway", "pg_ctl: could not start server", "Examine the log output."], "stdout": "waiting for server to start....2024-01-09 07:40:40.923 UTC [1662] FATAL: lock file \"postmaster.pid\" already exists\n2024-01-09 07:40:40.923 UTC [1662] HINT: Is another postmaster (PID 1629) running in data directory \"/srv/postgres14\"?\n stopped waiting", "stdout_lines": ["waiting for server to start....2024-01-09 07:40:40.923 UTC [1662] FATAL: lock file \"postmaster.pid\" already exists", "2024-01-09 07:40:40.923 UTC [1662] HINT: Is another postmaster (PID 1629) running in data directory \"/srv/postgres14\"?", " stopped waiting"]}

Because /srv/postgres14 exists, the database migration process will be skipped on the next iteration of updating PMM:

# cat /srv/logs/pmm-update-perform-init.log|grep -A 1 -B 1 Restore TASK [postgres : Restore pmm-managed database] ********************************* skipping: [localhost] -- TASK [postgres : Restore pmm-managed database] ********************************* skipping: [localhost]

You can see from supervisord.log that PostgreSQL is being restarted by supervisor:

cat /srv/logs/supervisord.log|grep postgres 2024-01-09 07:39:36,030 INFO spawned: 'postgresql' with pid 14 2024-01-09 07:39:37,046 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2024-01-09 07:40:04,952 INFO waiting for alertmanager, postgresql, pmm-agent, victoriametrics, grafana, cron, nginx, pmm-managed, qan-api2, vmalert, clickhouse to die 2024-01-09 07:40:08,040 INFO waiting for postgresql, clickhouse to die 2024-01-09 07:40:08,078 INFO stopped: postgresql (exit status 0) 2024-01-09 07:40:09,943 INFO spawned: 'postgresql' with pid 16 2024-01-09 07:40:10,179 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:11,220 INFO spawned: 'postgresql' with pid 95 2024-01-09 07:40:11,276 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:13,479 INFO spawned: 'postgresql' with pid 418 2024-01-09 07:40:13,516 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:17,335 INFO spawned: 'postgresql' with pid 591 2024-01-09 07:40:17,374 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:21,505 INFO spawned: 'postgresql' with pid 919 2024-01-09 07:40:21,536 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:26,542 INFO spawned: 'postgresql' with pid 1098 2024-01-09 07:40:26,554 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:32,894 INFO spawned: 'postgresql' with pid 1373 2024-01-09 07:40:32,909 INFO exited: postgresql (exit status 2; not expected) 2024-01-09 07:40:40,215 INFO spawned: 'postgresql' with pid 1629 2024-01-09 07:40:42,117 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

However, notice that it has not reached fatal state before not autostarting Postgresql for good.

Eg.

2024-01-09 07:51:22,037 INFO gave up: postgresql entered FATAL state, too many start retries too quickly

Perhaps the workaround is to add a delay before starting PostgreSQL without supervisor to allow supervisor to mark the postgresql as FATAL and ensure that no other instance of Postgres is running:

- name: Pause for 1 minute before starting Postgres 11 ansible.builtin.pause minutes: 1 - name: Run Postgres database without supervisor command: /usr/pgsql-11/bin/pg_ctl start -D /srv/postgres become: true become_user: postgres become_method: su when: is_upgrade.stat.exists - name: Dump pmm-managed database postgresql_db: name: pmm-managed state: dump target: /tmp/pmm-managed-upgrade.sql when: is_upgrade.stat.exists - name: Stop Postgres database without supervisor command: /usr/pgsql-11/bin/pg_ctl stop -D /srv/postgres become: true become_user: postgres become_method: su when: is_upgrade.stat.exists - name: Pause for 1 minute before starting Postgres 14 ansible.builtin.pause minutes: 1 - name: Start Postgres 14 database without supervisor command: /usr/pgsql-14/bin/pg_ctl start -D /srv/postgres14 -o "-c logging_collector=off" become: true become_user: postgres become_method: su

How to test

None

How to document

None

AFFECTED CS IDs

CS0042644

Activity

Roma Novikov January 19, 2024 at 1:43 PM

For better visibility the details about such updates added to our documentation: https://docs.percona.com/percona-monitoring-and-management/how-to/upgrade.html#update-the-server

Jaime Sicam January 10, 2024 at 9:15 AM

I think I understand why PostgreSQL keeps on restarting even after the container image has been changed. When the container image is changed from say 2.23.0 to 2.41.0, at the beginning, /srv/postgres14 doesn’t exist. So when PMM server is started, supervisor will attempt to restart PostgreSQL automatically several times until it gives up because the directory does not exist. Meanwhile, the Ansible playbook will try to stop both pmm-managed and postgresql if /srv/postgresql14 does not exist. This operation will be successful but still supervisor will keep on attempting to restart PostgreSQL. Then, the playbook will attempt to create /srv/postgres14 and initialize it. If supervisor is able to start PostgreSQL before the Ansible playbook does either when performing the backup or restoring the backup then the data migration will fail and will be skipped when PMM attempts to restart the update process again because /srv/postgres14 now exists.

To reproduce:

Create a playbook for stopping pmm-managed and postgresql:

cat stop.yaml - hosts: localhost tasks: - name: Stop pmm-managed and postgres before backup database supervisorctl: name: "{{ item }}" state: stopped loop: - pmm-managed - postgresql

Stop postgresql and move postgres14 to another directory

supervisorctl stop postgresql mv /srv/postgres14 /srv/postgres14backup

Attempt to start PostgreSQL with supervisor (It is a simulation as if the container just started running)

supervisorctl start postgresql

Attempt to stop PostgreSQL from playbook:

ansible-playbook stop.yaml

Check /srv/logs/supervisord.log and verify that the playbook couldn’t stop supervisor from restarting PostgreSQL:

2024-01-10 09:09:58,205 INFO spawned: 'postgresql' with pid 13884 2024-01-10 09:09:58,224 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:00,005 INFO spawned: 'postgresql' with pid 13885 2024-01-10 09:10:00,018 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:02,800 INFO spawned: 'postgresql' with pid 13894 2024-01-10 09:10:02,808 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:06,125 INFO spawned: 'postgresql' with pid 13902 2024-01-10 09:10:06,133 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:10,436 INFO spawned: 'postgresql' with pid 13919 2024-01-10 09:10:10,444 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:16,117 INFO spawned: 'postgresql' with pid 13929 2024-01-10 09:10:16,125 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:22,685 INFO spawned: 'postgresql' with pid 13988 2024-01-10 09:10:22,694 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:24,336 INFO waiting for pmm-managed to stop 2024-01-10 09:10:24,343 INFO stopped: pmm-managed (exit status 0) -- this is from the playbook 2024-01-10 09:10:24,343 INFO reaped unknown pid 14026 2024-01-10 09:10:30,002 INFO spawned: 'postgresql' with pid 14070 -- still supervisor still attempts to restart postgres 2024-01-10 09:10:30,011 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:38,109 INFO spawned: 'postgresql' with pid 14095 2024-01-10 09:10:38,118 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:47,171 INFO spawned: 'postgresql' with pid 14120 2024-01-10 09:10:47,185 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:57,864 INFO spawned: 'postgresql' with pid 14153 2024-01-10 09:10:57,873 INFO exited: postgresql (exit status 2; not expected) 2024-01-10 09:10:57,873 INFO gave up: postgresql entered FATAL state, too many start retries too quickly

Given this, it doesn’t matter what docker version is used as upgrade could potentially fail based on timing. I think the workaround is to wait until supervisor reaches the point where it will give up restarting PostgreSQL and then PMM can start updating the database.

Done

Details

Assignee

Reporter

Priority

Components

Labels

Needs QA

Yes

Needs Doc

Yes

Affects versions

Smart Checklist

Created January 9, 2024 at 9:25 AM
Updated March 5, 2024 at 9:53 PM
Resolved February 1, 2024 at 11:26 AM

Flag notifications