Mongo replication lag graph showing incorrect info

Description

User Impact: Replication lag for different instances in the mongoDB cluster presented by different values, and this will confuse users.

 

 

Actual Behaviour:

Wrong values for max replication lag and avg replication lag. Same replication lag value for all nodes.

Expected Behavior:

Different value per each node

Details:

The reason why we might have this behaviour it’s because mongodb_exporter generates multiple mongodb_mongod_replset_member_replication_lag metrics, one by each node in replica set like in example below.
So we can go by one of these ways:

  1. Create another metric that will show replication lag only for the current instance

  2. Update current metric to return metric only for the current instance

  3. Update current metric to add extra label to differentiate current instance.

# HELP mongodb_mongod_replset_member_replication_lag The replication lag that this member has with the primary. # TYPE mongodb_mongod_replset_member_replication_lag gauge mongodb_mongod_replset_member_replication_lag{name="172.22.0.2:27017",set="rs1",state="ARBITER"} 1.708949012e+09 mongodb_mongod_replset_member_replication_lag{name="172.22.0.3:27017",set="rs1",state="SECONDARY"} 0 mongodb_mongod_replset_member_replication_lag{name="172.22.0.5:27017",set="rs1",state="SECONDARY"} 0

 

Original report:

Hi PMM team,

We have pmm-agent running but upon checking the graph of Replication lag, it is showing incorrect info. For example, we could see there is a replication lag in some nodes from Mongo shell but in PMM, lag is showing for all nodes. In PMM, lag is showing almost the same for all nodes.

 

From DB replication lag:

 

MOJ-Replica:SECONDARY> new Date() ISODate("2022-01-03T08:31:01.271Z")MOJ-Replica:SECONDARY> rs.printSecondaryReplicationInfo() source: vm-prod-mongodb-ruh-01.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:02 GMT+0000 (UTC) 1 secs (0 hrs) behind the primary source: vm-prod-mongodb-ruh-02.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:04 GMT+0000 (UTC) -1 secs (0 hrs) behind the primary source: vm-prod-mongodb-ruh-04.red.sa:27017 syncedTo: Mon Jan 03 2022 08:27:55 GMT+0000 (UTC) 188 secs (0.05 hrs) behind the primary source: vm-prod-mongodb-ruh-03.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:02 GMT+0000 (UTC) 1 secs (0 hrs) behind the primary source: vm-prod-mongodb-jed-02.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:03 GMT+0000 (UTC) 0 secs (0 hrs) behind the primary source: vm-prod-mongodb-jed-01.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:03 GMT+0000 (UTC) 0 secs (0 hrs) behind the primary source: vm-prod-mongodb-percona-bkup:27017 syncedTo: Mon Jan 03 2022 08:31:04 GMT+0000 (UTC) -1 secs (0 hrs) behind the primary MOJ-Replica:SECONDARY> new Date() ISODate("2022-01-03T08:31:06.837Z") MOJ-Replica:SECONDARY>

 

 

 

From PMM, lag is for all nodes while it's available for vm-prod-mongodb-ruh-04 node only (for the same time period mention above).

 

Also, restarted the PMM agent but still showed the same lag info.

From the PMM-server container, couldn't get much log info, however, sharing the logs of it.

 

pmm-server container logs:

podman logs --syslog pmm-server . . . Params defined as: WARN[2022-01-03T08:08:07.897+00:00] Configuration warning: unknown environment variable "container=podman". 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/alertmanager.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/dbaas-controller.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/pmm.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/prometheus.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/qan-api2.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/victoriametrics.ini" during parsing 2022-01-03 08:08:08,446 INFO Included extra file "/etc/supervisord.d/vmalert.ini" during parsing 2022-01-03 08:08:08,446 INFO Set uid to user 0 succeeded 2022-01-03 08:08:08,475 INFO RPC interface 'supervisor' initialized 2022-01-03 08:08:08,476 INFO supervisord started with pid 1 2022-01-03 08:08:09,481 INFO spawned: 'postgresql' with pid 15 2022-01-03 08:08:09,483 INFO spawned: 'clickhouse' with pid 16 2022-01-03 08:08:09,486 INFO spawned: 'grafana' with pid 17 2022-01-03 08:08:09,494 INFO spawned: 'nginx' with pid 18 2022-01-03 08:08:09,497 INFO spawned: 'cron' with pid 19 2022-01-03 08:08:09,502 INFO spawned: 'victoriametrics' with pid 20 2022-01-03 08:08:09,513 INFO spawned: 'vmalert' with pid 21 2022-01-03 08:08:09,519 INFO spawned: 'alertmanager' with pid 22 2022-01-03 08:08:09,530 INFO spawned: 'dashboard-upgrade' with pid 23 2022-01-03 08:08:09,538 INFO spawned: 'qan-api2' with pid 24 2022-01-03 08:08:09,551 INFO spawned: 'pmm-managed' with pid 25 2022-01-03 08:08:09,593 INFO spawned: 'pmm-agent' with pid 26 2022-01-03 08:08:10,613 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,682 INFO success: clickhouse entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,687 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,715 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,763 INFO success: cron entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,767 INFO success: victoriametrics entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,769 INFO success: vmalert entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,772 INFO success: alertmanager entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,825 INFO success: dashboard-upgrade entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2022-01-03 08:08:10,827 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,828 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,855 INFO success: pmm-agent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:12,873 INFO exited: qan-api2 (exit status 1; not expected) 2022-01-03 08:08:13,096 INFO spawned: 'qan-api2' with pid 84 2022-01-03 08:08:13,424 INFO exited: qan-api2 (exit status 1; not expected) 2022-01-03 08:08:14,039 INFO exited: pmm-managed (exit status 1; not expected) 2022-01-03 08:08:14,046 INFO spawned: 'pmm-managed' with pid 136 2022-01-03 08:08:14,453 INFO spawned: 'qan-api2' with pid 171 2022-01-03 08:08:14,732 INFO exited: dashboard-upgrade (exit status 0; expected) 2022-01-03 08:08:15,048 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:15,684 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) [AESSCO] percona@monitor: ~ $ date Mon Jan 3 08:37:34 UTC 2022

Please look into it and let us know if require more info/logs.

 

How to test

  1. Setup MongoDB cluster

  2. Setup monitoring for all instances

  3. Wait until some data collected

  4. Check that no one of them has 54 years as max value for replication lag

  5. Stop one of secondary instances

  6. Check that replication lag is showing correctly

  7. wait again

  8. Start secondary instances

  9. Check that replication lag is showing correctly

How to document

None

Attachments

3
  • 26 Feb 2024, 11:28 AM
  • 26 Feb 2024, 10:18 AM
  • 03 Jan 2022, 08:35 AM

Smart Checklist

hide

Activity

Show:

Nurlan Moldomurov May 28, 2024 at 10:55 PM

yes and the second PR in grafana-dashboards

saikumar.vs May 21, 2024 at 6:46 AM
Edited

verified pls refer test case section for detailed info.

Roma Novikov February 27, 2024 at 2:56 PM

the expressions look too complex for me. I would vote for clear metrics instead of complex formulas in several places.

Nurlan Moldomurov February 27, 2024 at 12:41 PM

BTW , I’ve noticed that used that complex solution with mongodb_mongod_replset_my_name metric in a few panels already like this one https://github.com/percona/grafana-dashboards/blob/7cb2fe75752e5ee58428d9696e4e0a9d9c978001/dashboards/MongoDB/MongoDB_ReplSet_Summary.json#L1192

Ivan Groenewold February 27, 2024 at 11:55 AM
Edited

I suggest Nurlan’s option 3. Add a new label to differentiate the source of the lag as seen by each member. Also manual says:

The lag reported by secondaries may not be representative of cluster health. Negative values do not indicate that the secondary is ahead of the primary.

To obtain the most current status for your replica set, run db.printSecondaryReplicationInfo() on the primary.

Done

Details

Assignee

Reporter

Priority

Components

Needs QA

Yes

Needs Doc

No

Planned Version/s

Fix versions

Story Points

Affects versions

Smart Checklist Progress

Smart Checklist

Created January 3, 2022 at 8:42 AM
Updated August 8, 2024 at 5:02 AM
Resolved June 12, 2024 at 7:57 AM