MongoDB ReplSet Summary dashboard incorrect graph
Description
How to test
How to document
Attachments
Smart Checklist
Activity

Taras Kozub November 17, 2021 at 10:36 AM
Flag added
Waiting for the reply from reporter

Denys Kondratenko November 12, 2021 at 7:26 AM
hm if you pass different parameters with pmm-admin (second one after mongodb) - it should be indeed different.
Those additional parameters they have nothing to do with exporter. They are handled by pmm, first pmm-admin gather them:
second pmm-managed creates vmagentscrapecfg, and vmagent adds those metrics additionally.
Check out your vmagentscrapecfg for each node (your agent_id will be different):
/tmp/vm_agent/agent_id/e301bb7f-c1b5-417e-b24c-07ecde0429d5/vmagentscrapecfg
Could you please attach `pmm-admin summary` from each node, and all vmagentscrapecfg from all nodes and agents?

Artem Meshcheryakov November 12, 2021 at 12:41 AM
Denys, my facts are:
1) The environment/hostnames did not change.
2) Replicas are running on different hosts.
3) Each host is registered separately with the pmm-admin command, providing the unique service name and host name settings.
4) The dashboard worked fine in the past without changing the environment. It started to misbehave after a series of mongodb + pmm upgrades.
I am trying to drill it, and I think this is somewhat related to mongodb_exporter changes and the way it deals with labels.
I see that in my situation each replica provides mongodb_mongod_replset_member_replication_lag metric series for each service_name that exists in the replica set, while "node" and "node_name" are unique labels for it.
The change in the dashboard appeared here: https://github.com/percona/grafana-dashboards/pull/717/files - this is when aggregation by service_name was introduced.
So can you check mongodb_mongod_replset_member_replication_lag metrics in Grafana Explore - do you have a single series for single replica? Or multiple series for single replica with all possible service_names?

Denys Kondratenko November 11, 2021 at 1:12 PM
did that dashboard worked before ? maybe your environment changed so the hostname is always the same? If default doesn't work (replicas are running on the same host), additional `--service-name` makes sense.
The problem with changing default filed is that it could break for others. For example I attached screenshot earlier where I see name is not unique for rs0.

Artem Meshcheryakov November 8, 2021 at 12:54 AM
The exact command getting executed is:
pmm-admin add mongodb --cluster=$MONGODB_CLUSTER --replication-set=$MONGODB_REPLICASET --environment=$MONGODB_ENVIRONMENT --username=$MONGODB_EXPORTER_USER --password=$MONGODB_EXPORTER_PASSWORD --tls-skip-verify $MONGODB_SERVICE_NAME $MONGODB_HOST:27017
and variables "$MONGODB_SERVICE_NAME" and "$MONGODB_HOST" are unique for each monitored instance.
Also, we did not change the way pmm-agents register themselves in PMM2 for 2 years. But since recently the issue started to appear.
Anyway, if you do not confirm the bug, thank you for your help, if I find the exact cause, I will let you know.
Details
Assignee
UnassignedUnassignedReporter
Artem MeshcheryakovArtem Meshcheryakov(Deactivated)Priority
MediumComponents
Needs QA
YesNeeds Doc
YesStory Points
2Affects versions
Details
Details
Assignee
Reporter

Priority
Components
Needs QA
Needs Doc
Story Points
Affects versions
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

Hi PMM Team,
I had to fix the Replication Lag graph by replacing service_name->name in aggregating formula.
https://github.com/percona/grafana-dashboards/blob/main/dashboards/MongoDB_ReplSet_Summary.json#L862
Before:
avg by (service_name) (max(max_over_time(mongodb_mongod_replset_member_replication_lag{set="$replset",service_name="$secondary"}[$interval]) > 0) by (service_name,set) or max(max_over_time(mongodb_mongod_replset_member_replication_lag{set="$replset",service_name="$secondary"}[5m]) > 0) by (service_name,set))
After:
avg by (name) (max(max_over_time(mongodb_mongod_replset_member_replication_lag{set="$replset",service_name="$secondary"}[$interval]) > 0) by (name,set) or max(max_over_time(mongodb_mongod_replset_member_replication_lag{set="$replset",service_name="$secondary"}[5m]) > 0) by (name,set))
If aggregated by `service_name`, mongodb_mongod_replset_member_replication_lag metric returns the lag info for all replica set members from each replica. But aggregating by `name` gives correct per-replica info.
Could you please take a look and validate it?