Mongo replication lag graph showing incorrect info

General

Escalation

General

Escalation

Description

User Impact: Replication lag for different instances in the mongoDB cluster presented by different values, and this will confuse users.

Actual Behaviour:

Wrong values for max replication lag and avg replication lag. Same replication lag value for all nodes.

Expected Behavior:

Different value per each node

Details:

The reason why we might have this behaviour it’s because mongodb_exporter generates multiple mongodb_mongod_replset_member_replication_lag metrics, one by each node in replica set like in example below.
So we can go by one of these ways:

Create another metric that will show replication lag only for the current instance
Update current metric to return metric only for the current instance
Update current metric to add extra label to differentiate current instance.

# HELP mongodb_mongod_replset_member_replication_lag The replication lag that this member has with the primary.
# TYPE mongodb_mongod_replset_member_replication_lag gauge
mongodb_mongod_replset_member_replication_lag{name="172.22.0.2:27017",set="rs1",state="ARBITER"} 1.708949012e+09
mongodb_mongod_replset_member_replication_lag{name="172.22.0.3:27017",set="rs1",state="SECONDARY"} 0
mongodb_mongod_replset_member_replication_lag{name="172.22.0.5:27017",set="rs1",state="SECONDARY"} 0

Original report:

Hi PMM team,

We have pmm-agent running but upon checking the graph of Replication lag, it is showing incorrect info. For example, we could see there is a replication lag in some nodes from Mongo shell but in PMM, lag is showing for all nodes. In PMM, lag is showing almost the same for all nodes.

From DB replication lag:

MOJ-Replica:SECONDARY> new Date()
ISODate("2022-01-03T08:31:01.271Z")MOJ-Replica:SECONDARY> rs.printSecondaryReplicationInfo()
source: vm-prod-mongodb-ruh-01.red.sa:27017
 syncedTo: Mon Jan 03 2022 08:31:02 GMT+0000 (UTC)
 1 secs (0 hrs) behind the primary
source: vm-prod-mongodb-ruh-02.red.sa:27017
 syncedTo: Mon Jan 03 2022 08:31:04 GMT+0000 (UTC)
 -1 secs (0 hrs) behind the primary
source: vm-prod-mongodb-ruh-04.red.sa:27017
 syncedTo: Mon Jan 03 2022 08:27:55 GMT+0000 (UTC)
 188 secs (0.05 hrs) behind the primary
source: vm-prod-mongodb-ruh-03.red.sa:27017
 syncedTo: Mon Jan 03 2022 08:31:02 GMT+0000 (UTC)
 1 secs (0 hrs) behind the primary
source: vm-prod-mongodb-jed-02.red.sa:27017
 syncedTo: Mon Jan 03 2022 08:31:03 GMT+0000 (UTC)
 0 secs (0 hrs) behind the primary
source: vm-prod-mongodb-jed-01.red.sa:27017
 syncedTo: Mon Jan 03 2022 08:31:03 GMT+0000 (UTC)
 0 secs (0 hrs) behind the primary
source: vm-prod-mongodb-percona-bkup:27017
 syncedTo: Mon Jan 03 2022 08:31:04 GMT+0000 (UTC)
 -1 secs (0 hrs) behind the primary
MOJ-Replica:SECONDARY> new Date()
ISODate("2022-01-03T08:31:06.837Z")
MOJ-Replica:SECONDARY>

From PMM, lag is for all nodes while it's available for vm-prod-mongodb-ruh-04 node only (for the same time period mention above).

Also, restarted the PMM agent but still showed the same lag info.

From the PMM-server container, couldn't get much log info, however, sharing the logs of it.

pmm-server container logs:

podman logs --syslog pmm-server
.
.
.
Params defined as:
WARN[2022-01-03T08:08:07.897+00:00] Configuration warning: unknown environment variable "container=podman".
2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/alertmanager.ini" during parsing
2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/dbaas-controller.ini" during parsing
2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/pmm.ini" during parsing
2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/prometheus.ini" during parsing
2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/qan-api2.ini" during parsing
2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/victoriametrics.ini" during parsing
2022-01-03 08:08:08,446 INFO Included extra file "/etc/supervisord.d/vmalert.ini" during parsing
2022-01-03 08:08:08,446 INFO Set uid to user 0 succeeded
2022-01-03 08:08:08,475 INFO RPC interface 'supervisor' initialized
2022-01-03 08:08:08,476 INFO supervisord started with pid 1
2022-01-03 08:08:09,481 INFO spawned: 'postgresql' with pid 15
2022-01-03 08:08:09,483 INFO spawned: 'clickhouse' with pid 16
2022-01-03 08:08:09,486 INFO spawned: 'grafana' with pid 17
2022-01-03 08:08:09,494 INFO spawned: 'nginx' with pid 18
2022-01-03 08:08:09,497 INFO spawned: 'cron' with pid 19
2022-01-03 08:08:09,502 INFO spawned: 'victoriametrics' with pid 20
2022-01-03 08:08:09,513 INFO spawned: 'vmalert' with pid 21
2022-01-03 08:08:09,519 INFO spawned: 'alertmanager' with pid 22
2022-01-03 08:08:09,530 INFO spawned: 'dashboard-upgrade' with pid 23
2022-01-03 08:08:09,538 INFO spawned: 'qan-api2' with pid 24
2022-01-03 08:08:09,551 INFO spawned: 'pmm-managed' with pid 25
2022-01-03 08:08:09,593 INFO spawned: 'pmm-agent' with pid 26
2022-01-03 08:08:10,613 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,682 INFO success: clickhouse entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,687 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,715 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,763 INFO success: cron entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,767 INFO success: victoriametrics entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,769 INFO success: vmalert entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,772 INFO success: alertmanager entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,825 INFO success: dashboard-upgrade entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2022-01-03 08:08:10,827 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,828 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:10,855 INFO success: pmm-agent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:12,873 INFO exited: qan-api2 (exit status 1; not expected)
2022-01-03 08:08:13,096 INFO spawned: 'qan-api2' with pid 84
2022-01-03 08:08:13,424 INFO exited: qan-api2 (exit status 1; not expected)
2022-01-03 08:08:14,039 INFO exited: pmm-managed (exit status 1; not expected)
2022-01-03 08:08:14,046 INFO spawned: 'pmm-managed' with pid 136
2022-01-03 08:08:14,453 INFO spawned: 'qan-api2' with pid 171
2022-01-03 08:08:14,732 INFO exited: dashboard-upgrade (exit status 0; expected)
2022-01-03 08:08:15,048 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2022-01-03 08:08:15,684 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
[AESSCO] percona@monitor: ~ $ date
Mon Jan  3 08:37:34 UTC 2022

Please look into it and let us know if require more info/logs.

How to test

Setup MongoDB cluster
Setup monitoring for all instances
Wait until some data collected
Check that no one of them has 54 years as max value for replication lag
Stop one of secondary instances
Check that replication lag is showing correctly
wait again
Start secondary instances
Check that replication lag is showing correctly

How to document

None

Attachments

26 Feb 2024, 11:28 AM
26 Feb 2024, 10:18 AM
03 Jan 2022, 08:35 AM

Linked issues

blocks

PMM-12965

Improvement to the MongoDB replication lag alert

Smart Checklist

hide

Activity

Show:

Nurlan Moldomurov May 28, 2024 at 10:55 PM

yes and the second PR in grafana-dashboards

saikumar.vs May 21, 2024 at 6:46 AM
Edited

verified pls refer test case section for detailed info.

Roma Novikov February 27, 2024 at 2:56 PM

the expressions look too complex for me. I would vote for clear metrics instead of complex formulas in several places.

Nurlan Moldomurov February 27, 2024 at 12:41 PM

BTW , I’ve noticed that used that complex solution with mongodb_mongod_replset_my_name metric in a few panels already like this one https://github.com/percona/grafana-dashboards/blob/7cb2fe75752e5ee58428d9696e4e0a9d9c978001/dashboards/MongoDB/MongoDB_ReplSet_Summary.json#L1192

Ivan Groenewold February 27, 2024 at 11:55 AM
Edited

I suggest Nurlan’s option 3. Add a new label to differentiate the source of the lag as seen by each member. Also manual says:

The lag reported by secondaries may not be representative of cluster health. Negative values do not indicate that the secondary is ahead of the primary.

To obtain the most current status for your replica set, run db.printSecondaryReplicationInfo() on the primary.

Done

Details
Assignee
Nurlan Moldomurov
Reporter
Mukesh Kumar
Priority
Medium
Components
Labels
cs-tag-002refined
Needs QA
Yes
Needs Doc
No
Planned Version/s
2.42.0
Fix versions
2.42.0
Story Points
3
Affects versions
2.41.1
Smart Checklist Progress
1/1 - Done

Smart Checklist

Created January 3, 2022 at 8:42 AM

Updated August 8, 2024 at 5:02 AM

Resolved June 12, 2024 at 7:57 AM

Mongo replication lag graph showing incorrect info

Description

How to test

How to document

Attachments

Linked issues

blocks

Smart Checklist

Activity

Nurlan Moldomurov May 28, 2024 at 10:55 PM

saikumar.vs May 21, 2024 at 6:46 AMEdited

Roma Novikov February 27, 2024 at 2:56 PM

Nurlan Moldomurov February 27, 2024 at 12:41 PM

Ivan Groenewold February 27, 2024 at 11:55 AMEdited

DetailsAssigneeNurlan MoldomurovNurlan MoldomurovReporterMukesh KumarMukesh KumarPriorityMediumComponentsLabelscs-tag-002refinedNeeds QAYesNeeds DocNoPlanned Version/s2.42.0Fix versions2.42.0Story Points3Affects versions2.41.1Smart Checklist Progress1/1 - Done

Details

Assignee

Reporter

Priority

Components

Labels

Needs QA

Needs Doc

Planned Version/s

Fix versions

Story Points

Affects versions

Smart Checklist Progress

Smart ChecklistOpen Smart Checklist

Smart Checklist

saikumar.vs May 21, 2024 at 6:46 AM
Edited

Ivan Groenewold February 27, 2024 at 11:55 AM
Edited

Details
Assignee
Nurlan Moldomurov
Reporter
Mukesh Kumar
Priority
Medium
Components
Labels
cs-tag-002refined
Needs QA
Yes
Needs Doc
No
Planned Version/s
2.42.0
Fix versions
2.42.0
Story Points
3
Affects versions
2.41.1
Smart Checklist Progress
1/1 - Done

Smart Checklist