Dead Tuples graph on PostgreSQL Vacuum Monitoring Dashboard displays invalid percentage

General

Escalation

General

Escalation

Description

The issue can be seen on the pmmdemo website. The Dead Tuples graph displays 10000% dead tuples, which is invalid.

Here is the link: https://pmmdemo.percona.com/graph/d/postgres_vacuum_monitoring/postgresql-vacuum-monitoring?from=now-12h&to=now&var-interval=$__auto_interval_interval&var-environment=All&var-node_name=pg12&var-service_name=pg12-postgresql&var-database=sbtest&var-work_mem=4194304&var-version=14.7&var-max_connections=100&var-shared_buffers=134217728&var-wal_buffers=4194304&var-wal_segment_size=16777216&var-maintenance_work_mem=67108864&var-block_size=8192&var-checkpoint_segments=&var-checkpoint_timeout=300&var-default_statistics_target=100&var-seq_page_cost=1&var-random_page_cost=4&var-effective_cache_size=4294967296&var-effective_io_concurrency=1&var-fsync=1&var-autovacuum=1&var-autovacuum_analyze_scale_factor=0.1&var-autovacuum_analyze_threshold=50&var-autovacuum_vacuum_scale_factor=0.2&var-autovacuum_vacuum_threshold=50&var-autovacuum_vacuum_cost_limit=1684514664000&var-autovacuum_vacuum_cost_delay=0.002&var-autovacuum_max_workers=3&var-autovacuum_naptime=60&var-autovacuum_freeze_max_age=200000000&var-logging_collector=1&var-log_min_duration_statement=1684513058000&var-log_duration=0&var-log_lock_waits=0&var-max_wal_senders=10&var-max_wal_size=1073741824&var-min_wal_size=83886080&var-wal_compression=0&var-max_worker_processes=8&var-max_parallel_workers_per_gather=2&var-max_parallel_workers=2&var-autovacuum_work_mem=1684514664000&var-autovacuum_multixact_freeze_max_age=400000000&var-cluster=All&var-replication_set=All&var-node_type=All&var-service_type=All&var-username=All&var-schema=All&orgId=1&refresh=1m&var-tablename=sbtest1

How to test

Setup PMM , add Pg with Vacuum

How to document

AFFECTED CS IDs

CS0036307

Activity

Show:

saikumar.vs September 1, 2023 at 1:37 PM

Verified, pls see testcase section for further details.

Leonardo Bacchi Fernandes May 26, 2023 at 6:06 PM

Hello Aleksandar,

Sorry about that, and thank you for explaining! I saw the bug report marked as Invalid after your comment and thought you had marked it as such.

@Anton Bystrov could you please consider revisiting this bug? The problem is in the query being used, not the PMM display/configuration itself.

Best,

Leo

aleksandar.yanakiev May 26, 2023 at 11:59 AM

Hey Leonardo

All of this is up to Anton to figure out and improve, my task was just to check if the configuration and the behavior of pg12 instance is up-to-spec, so that's why I did added that in the comment. IMO definitely the graph needs either a change on the calculation or better explanation for the users to understand what metrics is actually showing.

Leonardo Bacchi Fernandes May 25, 2023 at 4:58 PM

Hello @aleksandar.yanakiev,

The graph is named dead tuples, and the percentage is 9998-10000%. The name "dead tuples" implies this is a percentage of the entire table that contains dead tuples. So, you should only be able to have 100% of the table as dead tuples (all tuples are marked as dead), right?

I guess the other option is this being the percentage of dead tuples over live tuples, which is confusing and unintuitive. In this case, would it mean there are 10x more dead tuples than live tuples in the table?

I think calculating this over the total size of the table would be best. For example, if you see 100% dead tuples in that graph, you'd think the table is composed entirely of dead tuples, but if you're calculating this based on live tuples, it means half the table is composed of dead tuples, which is far less intuitive (seeing 100% and understanding this is half the table).

What do you think?

aleksandar.yanakiev May 24, 2023 at 2:58 PM

I did double check the pg12 server on the pmmdemo side:

re-installing the packages `pmm-agent` and `percona-pg_stat_monitor12` (pmm-agent was updated but not the pg_stat_monitor)
drop the content of /var/lib/pgsql and re-init the DB
reconfigure the extension based on the latest config from https://docs.percona.com/pg-stat-monitor/
`DROP/CREATE EXTENSION pg_stat_monitor;`
double check the permissions of the dedicated monitoring user and make sure it’s working
delete and re-add the pmm service - `pmm-admin remove/add postgresql [SERVICE_NAME]`

There ware a couple of errors in the pmm-agent logs prior the above, which are all solved now:

May 24 10:24:21 pg12 pmm-agent[30325]: ERRO[2023-05-24T10:24:21.369+00:00] non default bucket time value is not supported, status changed to WAITING  agentID=/agent_id/ca1876b6-4fc2-41dc-ab0b-62ead2bbf0ee component=agent-builtin type=qan_postgresql_pgstatmonitor_agent
+
May 24 11:13:10 pg12 pmm-agent[18128]: ERRO[2023-05-24T11:13:10.600+00:00] failed to get extended monitor status: failed to query pg_stat_monitor: pq: column pg_stat_monitor.rows does not exist  agentID=/agent_id/ca1876b6-4fc2-41dc-ab0b-62ead2bbf0ee component=agent-builtin type=qan_postgresql_pgstatmonitor_agent
May 24 11:13:10 pg12 pmm-agent[18128]: WARN[2023-05-24T11:13:10.600+00:00] failed to get pgsm_bucket_time, wait time set on 60 seconds  agentID=/agent_id/ca1876b6-4fc2-41dc-ab0b-62ead2bbf0ee component=agent-builtin type=qan_postgresql_pgstatmonitor_agent
May 24 11:13:10 pg12 pmm-agent[18128]: INFO[2023-05-24T11:13:10.600+00:00] Sending status: WAITING.                      agentID=/agent_id/ca1876b6-4fc2-41dc-ab0b-62ead2bbf0ee component=agent-builtin type=qan_postgresql_pgstatmonitor_agent

The most promising find was the fact that after stopping the sysbench load, the graph in PMM start showing `No data`, the moment I start the sysbench back, the graph in PMM start showing numbers in the range 9998-10000%. Most probably the artificial load from sysbench is generating this absurd (or may be not that absurd) values

Hope all of the above is somewhat helpful.

Done

Details
Assignee
Unassigned
Reporter
Leonardo Bacchi Fernandes
Priority
Medium
Components
Grafana Dashboards
Labels
ES
Needs QA
Yes
Needs Doc
No
Planned Version/s
2.40.0
Fix versions
2.40.0
Story Points
2
Affects versions
2.37.0
Parent
PMM-11582 [RECURRENT] Dashboards & Metrics collection: support

Smart Checklist

Created May 19, 2023 at 4:47 PM

Updated October 9, 2024 at 8:37 AM

Resolved September 8, 2023 at 9:52 AM

Dead Tuples graph on PostgreSQL Vacuum Monitoring Dashboard displays invalid percentage

Description

How to test

How to document

AFFECTED CS IDs

Activity

saikumar.vs September 1, 2023 at 1:37 PM

Leonardo Bacchi Fernandes May 26, 2023 at 6:06 PM

aleksandar.yanakiev May 26, 2023 at 11:59 AM

Leonardo Bacchi Fernandes May 25, 2023 at 4:58 PM

aleksandar.yanakiev May 24, 2023 at 2:58 PM

Details
Assignee
Unassigned
Reporter
Leonardo Bacchi Fernandes
Priority
Medium
Components
Grafana Dashboards
Labels
ES
Needs QA
Yes
Needs Doc
No
Planned Version/s
2.40.0
Fix versions
2.40.0
Story Points
2
Affects versions
2.37.0
Parent
PMM-11582 [RECURRENT] Dashboards & Metrics collection: support

Details

Assignee

Reporter

Priority

Components

Labels

Needs QA

Needs Doc

Planned Version/s

Fix versions

Story Points

Affects versions

Parent

Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Dead Tuples graph on PostgreSQL Vacuum Monitoring Dashboard displays invalid percentage

Description

How to test

How to document

AFFECTED CS IDs

Activity

saikumar.vs September 1, 2023 at 1:37 PM

Leonardo Bacchi Fernandes May 26, 2023 at 6:06 PM

aleksandar.yanakiev May 26, 2023 at 11:59 AM

Leonardo Bacchi Fernandes May 25, 2023 at 4:58 PM

aleksandar.yanakiev May 24, 2023 at 2:58 PM

Details

Assignee

Reporter

Priority

Components

Labels

Needs QA

Needs Doc

Planned Version/s

Fix versions

Story Points

Affects versions

Parent

Smart ChecklistOpen Smart Checklist

Smart Checklist

Flag notifications

Something's gone wrong

Something's gone wrong

Smart Checklist