I would like to report some interesting upgrade problem.
I can upgrade my PMM from 2.37.1 to 2.39.0, the upgrade was did successfully.
When I tried to upgrade to 2.40.0 or 2.41.2 from 2.39.0 it failed with error 503.
Here are more details about my environment:
Kubernetes: v1.21.14
AWS instance type: c6a.2xlarge
OS: Ubuntu Focal 20.04.amd64-server
kubectl describe pod pmm
Events: Type Reason Age From Message
Normal Scheduled 40s default-scheduler Successfully assigned pmm/percona-monitoring-pmm-0 to Normal Pulling 39s kubelet Pulling image "percona/pmm-server:2.41.2" Normal Pulled 18s kubelet Successfully pulled image "percona/pmm-server:2.41.2" in 20.72093687s Normal Created 10s kubelet Created container pmm Normal Started 9s kubelet Started container pmm Warning Unhealthy 0s (x2 over 5s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
I’m attaching to my pod to see running services on it:
supervisorctl status alertmanager RUNNING pid 45, uptime 0:01:03 clickhouse RUNNING pid 28, uptime 0:01:03 dbaas-controller STOPPED Not started grafana BACKOFF Exited too quickly (process log may have details) nginx RUNNING pid 30, uptime 0:01:03 pmm-agent RUNNING pid 568, uptime 0:00:58 pmm-managed RUNNING pid 67, uptime 0:01:03 pmm-update-perform STOPPED Not started pmm-update-perform-init EXITED Jun 10 12:24 PM postgresql RUNNING pid 18, uptime 0:01:03 prometheus STOPPED Not started qan-api2 RUNNING pid 735, uptime 0:00:56 victoriametrics BACKOFF Exited too quickly (process log may have details) vmalert RUNNING pid 42, uptime 0:01:03 vmproxy RUNNING pid 51, uptime 0:01:03
As you can see, grafana and victoriametrics services are not working.
tail -f /srv/logs/grafana.log logger=settings t=2024-06-10T12:25:04.216094091Z level=info msg="Path Data" path=/srv/grafana logger=settings t=2024-06-10T12:25:04.216096111Z level=info msg="Path Logs" path=/srv/logs logger=settings t=2024-06-10T12:25:04.216098021Z level=info msg="Path Plugins" path=/srv/grafana/plugins logger=settings t=2024-06-10T12:25:04.216099911Z level=info msg="Path Provisioning" path=/usr/share/grafana/conf/provisioning logger=settings t=2024-06-10T12:25:04.216101931Z level=info msg="App mode production" logger=sqlstore t=2024-06-10T12:25:04.216148452Z level=info msg="Connecting to DB" dbtype=postgres logger=migrator t=2024-06-10T12:25:04.241617879Z level=info msg="Starting DB migrations" logger=migrator t=2024-06-10T12:25:04.243740077Z level=info msg="Executing migration" id="Add OAuth ID token to user_auth" Failed to start grafana. error: migration failed (id = Add OAuth ID token to user_auth): pq: invalid input syntax for type integer: "true" migration failed (id = Add OAuth ID token to user_auth): pq: invalid input syntax for type integer: "true"
tail -n 100 /srv/logs/victoriametrics.log
2024-06-10T12:25:10.402Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/victoria-metrics/main.go:78 starting VictoriaMetrics at "127.0.0.1:9090"... 2024-06-10T12:25:10.402Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/vmstorage/main.go:109 opening storage at "/srv/victoriametrics/data" with -retentionPeriod=30d 2024-06-10T12:25:10.409Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/memory/memory.go:42 limiting caches to 1932735283 bytes, leaving 1288490189 bytes to the OS according to -memory.allowedPercent=60 2024-06-10T12:25:10.706Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/storage.go:889 discarding /srv/victoriametrics/data/cache/next_day_metric_ids_v2, since it contains data for stale generation; got 1717593008813727069; want 1717747268211120789 2024-06-10T12:25:10.707Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/storage.go:894 discarding /srv/victoriametrics/data/cache/next_day_metric_ids_v2, since it contains data for stale date; got 19879; want 19884 2024-06-10T12:25:10.729Z panic /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/part_header.go:142 FATAL: cannot parse metadata from "/srv/victoriametrics/data/data/small/2024_06/17D61DBDF650877A": unexpected number of substrings in the part name "17D61DBDF650877A": got 1; want 5 panic: FATAL: cannot parse metadata from "/srv/victoriametrics/data/data/small/2024_06/17D61DBDF650877A": unexpected number of substrings in the part name "17D61DBDF650877A": got 1; want 5
I have installed minikube with the same Kubernetes version v1.21.14 on my local computer and the upgrade is working correctly. On minikube using the default storage type. In minikube and aws I used the same helm chart.
/*Previously I upgraded from 2.33 to 2.37.1 and worked without any issue. */
Here are the CPU instructions from my AWS EC2 instance:
I would like to report some interesting upgrade problem.
I can upgrade my PMM from 2.37.1 to 2.39.0, the upgrade was did successfully.
When I tried to upgrade to 2.40.0 or 2.41.2 from 2.39.0 it failed with error 503.
Here are more details about my environment:
Kubernetes: v1.21.14
AWS instance type: c6a.2xlarge
OS: Ubuntu Focal 20.04.amd64-server
kubectl describe pod pmm
Events:
Type Reason Age From Message
Normal Scheduled 40s default-scheduler Successfully assigned pmm/percona-monitoring-pmm-0 to
Normal Pulling 39s kubelet Pulling image "percona/pmm-server:2.41.2"
Normal Pulled 18s kubelet Successfully pulled image "percona/pmm-server:2.41.2" in 20.72093687s
Normal Created 10s kubelet Created container pmm
Normal Started 9s kubelet Started container pmm
Warning Unhealthy 0s (x2 over 5s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
I’m attaching to my pod to see running services on it:
supervisorctl status
alertmanager RUNNING pid 45, uptime 0:01:03
clickhouse RUNNING pid 28, uptime 0:01:03
dbaas-controller STOPPED Not started
grafana BACKOFF Exited too quickly (process log may have details)
nginx RUNNING pid 30, uptime 0:01:03
pmm-agent RUNNING pid 568, uptime 0:00:58
pmm-managed RUNNING pid 67, uptime 0:01:03
pmm-update-perform STOPPED Not started
pmm-update-perform-init EXITED Jun 10 12:24 PM
postgresql RUNNING pid 18, uptime 0:01:03
prometheus STOPPED Not started
qan-api2 RUNNING pid 735, uptime 0:00:56
victoriametrics BACKOFF Exited too quickly (process log may have details)
vmalert RUNNING pid 42, uptime 0:01:03
vmproxy RUNNING pid 51, uptime 0:01:03
As you can see, grafana and victoriametrics services are not working.
tail -f /srv/logs/grafana.log
logger=settings t=2024-06-10T12:25:04.216094091Z level=info msg="Path Data" path=/srv/grafana
logger=settings t=2024-06-10T12:25:04.216096111Z level=info msg="Path Logs" path=/srv/logs
logger=settings t=2024-06-10T12:25:04.216098021Z level=info msg="Path Plugins" path=/srv/grafana/plugins
logger=settings t=2024-06-10T12:25:04.216099911Z level=info msg="Path Provisioning" path=/usr/share/grafana/conf/provisioning
logger=settings t=2024-06-10T12:25:04.216101931Z level=info msg="App mode production"
logger=sqlstore t=2024-06-10T12:25:04.216148452Z level=info msg="Connecting to DB" dbtype=postgres
logger=migrator t=2024-06-10T12:25:04.241617879Z level=info msg="Starting DB migrations"
logger=migrator t=2024-06-10T12:25:04.243740077Z level=info msg="Executing migration" id="Add OAuth ID token to user_auth"
Failed to start grafana. error: migration failed (id = Add OAuth ID token to user_auth): pq: invalid input syntax for type integer: "true"
migration failed (id = Add OAuth ID token to user_auth): pq: invalid input syntax for type integer: "true"
tail -n 100 /srv/logs/victoriametrics.log
2024-06-10T12:25:10.402Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/victoria-metrics/main.go:78 starting VictoriaMetrics at "127.0.0.1:9090"...
2024-06-10T12:25:10.402Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/vmstorage/main.go:109 opening storage at "/srv/victoriametrics/data" with -retentionPeriod=30d
2024-06-10T12:25:10.409Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/memory/memory.go:42 limiting caches to 1932735283 bytes, leaving 1288490189 bytes to the OS according to -memory.allowedPercent=60
2024-06-10T12:25:10.706Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/storage.go:889 discarding /srv/victoriametrics/data/cache/next_day_metric_ids_v2, since it contains data for stale generation; got 1717593008813727069; want 1717747268211120789
2024-06-10T12:25:10.707Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/storage.go:894 discarding /srv/victoriametrics/data/cache/next_day_metric_ids_v2, since it contains data for stale date; got 19879; want 19884
2024-06-10T12:25:10.729Z panic /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/part_header.go:142 FATAL: cannot parse metadata from "/srv/victoriametrics/data/data/small/2024_06/17D61DBDF650877A": unexpected number of substrings in the part name "17D61DBDF650877A": got 1; want 5
panic: FATAL: cannot parse metadata from "/srv/victoriametrics/data/data/small/2024_06/17D61DBDF650877A": unexpected number of substrings in the part name "17D61DBDF650877A": got 1; want 5
goroutine 1 [running]:
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logMessage ({0x110dd35, 0x5}, {0xc0004ec0c0, 0xb5}, 0x2?)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/logger/logger.go:309 +0xa91
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevelSkipframes (0x1, {0x110dd35, 0x5}, {0x114a450?, 0x0?}, {0xc00068eca0?, 0x445d71?, 0xc0000061a0?})
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/logger/logger.go:138 +0x199
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevel(...)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/logger/logger.go:130
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.Panicf(...)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/logger/logger.go:126
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partHeader ).MustReadMetadata(0xc0001c86c0, {0xc00078c400, 0x3d})
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/part_header.go:142 +0x345
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenFilePart ({0xc00078c400?, 0x2c?})
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/part.go:53 +0x65
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenParts ({0xc0001c8450, 0x2c}, {0xc000474800, 0x13, 0x4e9ba0?})
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/partition.go:1805 +0x433
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenPartition ({0xc0001c8450?, 0xc00068f248?}, {0xc0001c8480, 0x2a}, 0xc0009f8780?)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/partition.go:267 +0x246
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenPartitions ({0xc0001c80c0, 0x24}, {0xc0001c8270, 0x22}, 0x0?)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/table.go:480 +0x275
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenTable ({0xc00003a060?, 0x17d6aa0a179f1a95?}, 0xc000103a00)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/table.go:103 +0x269
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.MustOpenStorage ({0x7ffc8567390b?, 0x110d2d9?}, 0x9356907420000, 0x0, 0x0)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/storage.go:273 +0xfc5
http://github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage.Init(0x12f6420)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/vmstorage/main.go:112 +0x51e
main.main()
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/victoria-metrics/main.go:85 +0x373
The storage type which I use in aws is : efs
I have installed minikube with the same Kubernetes version v1.21.14 on my local computer and the upgrade is working correctly. On minikube using the default storage type. In minikube and aws I used the same helm chart.
/*Previously I upgraded from 2.33 to 2.37.1 and worked without any issue. */
Here are the CPU instructions from my AWS EC2 instance:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
On my DEV env I hit this issue on my proxmox
but I changed my CPU type. I have tested the same upgrade path on my proxmox env and it works correctly.
In AWS when the upgrade to 2.40.0 failed, I reverted back to 2.39.0 and the pod is running without any issue.
Think I hit some limitation, but not sure.
Please advise. Thank you in advance !!!