Upgrade fail from 2.39.0 to 2.40.0

Description

I would like to report some interesting upgrade problem.

I can upgrade my PMM from 2.37.1 to 2.39.0, the upgrade was did successfully.

When I tried to upgrade to 2.40.0 or 2.41.2 from 2.39.0 it failed with error 503.

Here are more details about my environment:

Kubernetes: v1.21.14

AWS instance type: c6a.2xlarge

OS: Ubuntu Focal 20.04.amd64-server

 

kubectl describe pod pmm

Events:
Type Reason Age From Message


Normal Scheduled 40s default-scheduler Successfully assigned pmm/percona-monitoring-pmm-0 to
Normal Pulling 39s kubelet Pulling image "percona/pmm-server:2.41.2"
Normal Pulled 18s kubelet Successfully pulled image "percona/pmm-server:2.41.2" in 20.72093687s
Normal Created 10s kubelet Created container pmm
Normal Started 9s kubelet Started container pmm
Warning Unhealthy 0s (x2 over 5s) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500

I’m attaching to my pod to see running services on it:

supervisorctl status
alertmanager RUNNING pid 45, uptime 0:01:03
clickhouse RUNNING pid 28, uptime 0:01:03
dbaas-controller STOPPED Not started
grafana BACKOFF Exited too quickly (process log may have details)
nginx RUNNING pid 30, uptime 0:01:03
pmm-agent RUNNING pid 568, uptime 0:00:58
pmm-managed RUNNING pid 67, uptime 0:01:03
pmm-update-perform STOPPED Not started
pmm-update-perform-init EXITED Jun 10 12:24 PM
postgresql RUNNING pid 18, uptime 0:01:03
prometheus STOPPED Not started
qan-api2 RUNNING pid 735, uptime 0:00:56
victoriametrics BACKOFF Exited too quickly (process log may have details)
vmalert RUNNING pid 42, uptime 0:01:03
vmproxy RUNNING pid 51, uptime 0:01:03

As you can see, grafana and victoriametrics services are not working.

tail -f /srv/logs/grafana.log
logger=settings t=2024-06-10T12:25:04.216094091Z level=info msg="Path Data" path=/srv/grafana
logger=settings t=2024-06-10T12:25:04.216096111Z level=info msg="Path Logs" path=/srv/logs
logger=settings t=2024-06-10T12:25:04.216098021Z level=info msg="Path Plugins" path=/srv/grafana/plugins
logger=settings t=2024-06-10T12:25:04.216099911Z level=info msg="Path Provisioning" path=/usr/share/grafana/conf/provisioning
logger=settings t=2024-06-10T12:25:04.216101931Z level=info msg="App mode production"
logger=sqlstore t=2024-06-10T12:25:04.216148452Z level=info msg="Connecting to DB" dbtype=postgres
logger=migrator t=2024-06-10T12:25:04.241617879Z level=info msg="Starting DB migrations"
logger=migrator t=2024-06-10T12:25:04.243740077Z level=info msg="Executing migration" id="Add OAuth ID token to user_auth"
Failed to start grafana. error: migration failed (id = Add OAuth ID token to user_auth): pq: invalid input syntax for type integer: "true"
migration failed (id = Add OAuth ID token to user_auth): pq: invalid input syntax for type integer: "true"

 

tail -n 100 /srv/logs/victoriametrics.log

2024-06-10T12:25:10.402Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/victoria-metrics/main.go:78 starting VictoriaMetrics at "127.0.0.1:9090"...
2024-06-10T12:25:10.402Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/vmstorage/main.go:109 opening storage at "/srv/victoriametrics/data" with -retentionPeriod=30d
2024-06-10T12:25:10.409Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/memory/memory.go:42 limiting caches to 1932735283 bytes, leaving 1288490189 bytes to the OS according to -memory.allowedPercent=60
2024-06-10T12:25:10.706Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/storage.go:889 discarding /srv/victoriametrics/data/cache/next_day_metric_ids_v2, since it contains data for stale generation; got 1717593008813727069; want 1717747268211120789
2024-06-10T12:25:10.707Z info /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/storage.go:894 discarding /srv/victoriametrics/data/cache/next_day_metric_ids_v2, since it contains data for stale date; got 19879; want 19884
2024-06-10T12:25:10.729Z panic /home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/part_header.go:142 FATAL: cannot parse metadata from "/srv/victoriametrics/data/data/small/2024_06/17D61DBDF650877A": unexpected number of substrings in the part name "17D61DBDF650877A": got 1; want 5
panic: FATAL: cannot parse metadata from "/srv/victoriametrics/data/data/small/2024_06/17D61DBDF650877A": unexpected number of substrings in the part name "17D61DBDF650877A": got 1; want 5

goroutine 1 [running]:
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logMessage ({0x110dd35, 0x5}, {0xc0004ec0c0, 0xb5}, 0x2?)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/logger/logger.go:309 +0xa91
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevelSkipframes (0x1, {0x110dd35, 0x5}, {0x114a450?, 0x0?}, {0xc00068eca0?, 0x445d71?, 0xc0000061a0?})
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/logger/logger.go:138 +0x199
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.logLevel(...)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/logger/logger.go:130
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/logger.Panicf(...)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/logger/logger.go:126
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.(*partHeader ).MustReadMetadata(0xc0001c86c0, {0xc00078c400, 0x3d})
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/part_header.go:142 +0x345
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenFilePart ({0xc00078c400?, 0x2c?})
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/part.go:53 +0x65
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenParts ({0xc0001c8450, 0x2c}, {0xc000474800, 0x13, 0x4e9ba0?})
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/partition.go:1805 +0x433
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenPartition ({0xc0001c8450?, 0xc00068f248?}, {0xc0001c8480, 0x2a}, 0xc0009f8780?)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/partition.go:267 +0x246
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenPartitions ({0xc0001c80c0, 0x24}, {0xc0001c8270, 0x22}, 0x0?)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/table.go:480 +0x275
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.mustOpenTable ({0xc00003a060?, 0x17d6aa0a179f1a95?}, 0xc000103a00)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/table.go:103 +0x269
http://github.com/VictoriaMetrics/VictoriaMetrics/lib/storage.MustOpenStorage ({0x7ffc8567390b?, 0x110d2d9?}, 0x9356907420000, 0x0, 0x0)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/lib/storage/storage.go:273 +0xfc5
http://github.com/VictoriaMetrics/VictoriaMetrics/app/vmstorage.Init(0x12f6420)
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/vmstorage/main.go:112 +0x51e
main.main()
/home/builder/rpm/BUILD/VictoriaMetrics-pmm-6401-v1.93.4/app/victoria-metrics/main.go:85 +0x373

 

The storage type which I use in aws is : efs

I have installed minikube with the same Kubernetes version v1.21.14 on my local computer and the upgrade is working correctly. On minikube using the default storage type. In minikube and aws I used the same helm chart.

/*Previously I upgraded from 2.33 to 2.37.1 and worked without any issue. */

 

Here are the CPU instructions from my AWS EC2 instance:

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save vaes vpclmulqdq rdpid
bugs : sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass

On my DEV env I hit this issue on my proxmox

but I changed my CPU type. I have tested the same upgrade path on my proxmox env and it works correctly.

In AWS when the upgrade to 2.40.0 failed, I reverted back to 2.39.0 and the pod is running without any issue.

Think I hit some limitation, but not sure.

Please advise. Thank you in advance !!!

How to test

None

How to document

None

Activity

Details

Assignee

Reporter

Priority

Components

Labels

Needs QA

Needs Doc

Affects versions

Environment

Kubernetes: v1.21.14

AWS instance type: c6a.2xlarge

OS: Ubuntu Focal 20.04.amd64-server

Created June 10, 2024 at 1:21 PM
Updated August 12, 2024 at 10:06 AM