MongoDB ReplSet Summary dashboard incorrect graph

Description

Hi PMM Team,

I had to fix the Replication Lag graph by replacing service_name->name in aggregating formula.

https://github.com/percona/grafana-dashboards/blob/main/dashboards/MongoDB_ReplSet_Summary.json#L862

 

Before:

avg by (service_name) (max(max_over_time(mongodb_mongod_replset_member_replication_lag{set="$replset",service_name="$secondary"}[$interval]) > 0) by (service_name,set) or max(max_over_time(mongodb_mongod_replset_member_replication_lag{set="$replset",service_name="$secondary"}[5m]) > 0) by (service_name,set))

After:

avg by (name) (max(max_over_time(mongodb_mongod_replset_member_replication_lag{set="$replset",service_name="$secondary"}[$interval]) > 0) by (name,set) or max(max_over_time(mongodb_mongod_replset_member_replication_lag{set="$replset",service_name="$secondary"}[5m]) > 0) by (name,set))

If aggregated by `service_name`, mongodb_mongod_replset_member_replication_lag metric returns the lag info for all replica set members from each replica. But aggregating by `name` gives correct per-replica info.

 

Could you please take a look and validate it?

How to test

None

How to document

None

Attachments

5

Smart Checklist

Activity

Show:

Taras Kozub November 17, 2021 at 10:36 AM

Flag added

Waiting for the reply from reporter

Denys Kondratenko November 12, 2021 at 7:26 AM

hm if you pass different parameters with pmm-admin (second one after mongodb) - it should be indeed different.

 

Those additional parameters they have nothing to do with exporter. They are handled by pmm, first pmm-admin gather them:

https://github.com/percona/pmm-admin/blob/2af1c987f304a3fa375af0986b6af6a0390d41bd/commands/management/add_mongodb.go#L179

second pmm-managed creates vmagentscrapecfg, and vmagent adds those metrics additionally.

 

Check out your vmagentscrapecfg for each node (your agent_id will be different):

/tmp/vm_agent/agent_id/e301bb7f-c1b5-417e-b24c-07ecde0429d5/vmagentscrapecfg

 

Could you please attach `pmm-admin summary` from each node, and all vmagentscrapecfg from all nodes and agents?

Artem Meshcheryakov November 12, 2021 at 12:41 AM

Denys, my facts are:

1) The environment/hostnames did not change.

2) Replicas are running on different hosts.

3) Each host is registered separately with the pmm-admin command, providing the unique service name and host name settings.

4) The dashboard worked fine in the past without changing the environment. It started to misbehave after a series of mongodb + pmm upgrades.

I am trying to drill it, and I think this is somewhat related to mongodb_exporter changes and the way it deals with labels. 

I see that in my situation each replica provides mongodb_mongod_replset_member_replication_lag metric series for each service_name that exists in the replica set, while "node" and "node_name" are unique labels for it.

The change in the dashboard appeared here: https://github.com/percona/grafana-dashboards/pull/717/files - this is when aggregation by service_name was introduced. 

So can you check mongodb_mongod_replset_member_replication_lag metrics in Grafana Explore - do you have a single series for single replica? Or multiple series for single replica with all possible service_names? 

Denys Kondratenko November 11, 2021 at 1:12 PM

did that dashboard worked before ? maybe your environment changed so the hostname is always the same? If default doesn't work (replicas are running on the same host), additional `--service-name` makes sense.

The problem with changing default filed is that it could break for others. For example I attached screenshot earlier where I see name is not unique for rs0.

Artem Meshcheryakov November 8, 2021 at 12:54 AM

The exact command getting executed is:

pmm-admin add mongodb --cluster=$MONGODB_CLUSTER --replication-set=$MONGODB_REPLICASET --environment=$MONGODB_ENVIRONMENT --username=$MONGODB_EXPORTER_USER --password=$MONGODB_EXPORTER_PASSWORD --tls-skip-verify $MONGODB_SERVICE_NAME $MONGODB_HOST:27017

and variables "$MONGODB_SERVICE_NAME" and "$MONGODB_HOST" are unique for each monitored instance.

Also, we did not change the way pmm-agents register themselves in PMM2 for 2 years. But since recently the issue started to appear.

Anyway, if you do not confirm the bug, thank you for your help, if I find the exact cause, I will let you know.

 

Flagged

Details

Assignee

Reporter

Priority

Components

Needs QA

Yes

Needs Doc

Yes

Story Points

Affects versions

Smart Checklist

Created October 18, 2021 at 1:57 AM
Updated March 6, 2024 at 2:00 AM