Done
Details
Assignee
Nurlan MoldomurovNurlan MoldomurovReporter
Mukesh KumarMukesh KumarPriority
MediumComponents
Labels
Needs QA
YesNeeds Doc
NoPlanned Version/s
Fix versions
Story Points
3Affects versions
Smart Checklist Progress
1/1 - Done
Details
Details
Assignee
Nurlan Moldomurov
Nurlan MoldomurovReporter
Mukesh Kumar
Mukesh KumarPriority
Components
Labels
Needs QA
Yes
Needs Doc
No
Planned Version/s
Fix versions
Story Points
3
Affects versions
Smart Checklist Progress
1/1 - Done
Smart Checklist
Smart Checklist
Smart Checklist
Created January 3, 2022 at 8:42 AM
Updated August 8, 2024 at 5:02 AM
Resolved June 12, 2024 at 7:57 AM
User Impact: Replication lag for different instances in the mongoDB cluster presented by different values, and this will confuse users.
Actual Behaviour:
Wrong values for max replication lag and avg replication lag. Same replication lag value for all nodes.
Expected Behavior:
Different value per each node
Details:
The reason why we might have this behaviour it’s because mongodb_exporter generates multiple
mongodb_mongod_replset_member_replication_lag
metrics, one by each node in replica set like in example below.So we can go by one of these ways:
Create another metric that will show replication lag only for the current instance
Update current metric to return metric only for the current instance
Update current metric to add extra label to differentiate current instance.
# HELP mongodb_mongod_replset_member_replication_lag The replication lag that this member has with the primary. # TYPE mongodb_mongod_replset_member_replication_lag gauge mongodb_mongod_replset_member_replication_lag{name="172.22.0.2:27017",set="rs1",state="ARBITER"} 1.708949012e+09 mongodb_mongod_replset_member_replication_lag{name="172.22.0.3:27017",set="rs1",state="SECONDARY"} 0 mongodb_mongod_replset_member_replication_lag{name="172.22.0.5:27017",set="rs1",state="SECONDARY"} 0
Original report:
Hi PMM team,
We have pmm-agent running but upon checking the graph of Replication lag, it is showing incorrect info. For example, we could see there is a replication lag in some nodes from Mongo shell but in PMM, lag is showing for all nodes. In PMM, lag is showing almost the same for all nodes.
From DB replication lag:
MOJ-Replica:SECONDARY> new Date() ISODate("2022-01-03T08:31:01.271Z")MOJ-Replica:SECONDARY> rs.printSecondaryReplicationInfo() source: vm-prod-mongodb-ruh-01.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:02 GMT+0000 (UTC) 1 secs (0 hrs) behind the primary source: vm-prod-mongodb-ruh-02.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:04 GMT+0000 (UTC) -1 secs (0 hrs) behind the primary source: vm-prod-mongodb-ruh-04.red.sa:27017 syncedTo: Mon Jan 03 2022 08:27:55 GMT+0000 (UTC) 188 secs (0.05 hrs) behind the primary source: vm-prod-mongodb-ruh-03.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:02 GMT+0000 (UTC) 1 secs (0 hrs) behind the primary source: vm-prod-mongodb-jed-02.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:03 GMT+0000 (UTC) 0 secs (0 hrs) behind the primary source: vm-prod-mongodb-jed-01.red.sa:27017 syncedTo: Mon Jan 03 2022 08:31:03 GMT+0000 (UTC) 0 secs (0 hrs) behind the primary source: vm-prod-mongodb-percona-bkup:27017 syncedTo: Mon Jan 03 2022 08:31:04 GMT+0000 (UTC) -1 secs (0 hrs) behind the primary MOJ-Replica:SECONDARY> new Date() ISODate("2022-01-03T08:31:06.837Z") MOJ-Replica:SECONDARY>
From PMM, lag is for all nodes while it's available for vm-prod-mongodb-ruh-04 node only (for the same time period mention above).
Also, restarted the PMM agent but still showed the same lag info.
From the PMM-server container, couldn't get much log info, however, sharing the logs of it.
pmm-server container logs:
podman logs --syslog pmm-server . . . Params defined as: WARN[2022-01-03T08:08:07.897+00:00] Configuration warning: unknown environment variable "container=podman". 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/alertmanager.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/dbaas-controller.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/pmm.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/prometheus.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/qan-api2.ini" during parsing 2022-01-03 08:08:08,445 INFO Included extra file "/etc/supervisord.d/victoriametrics.ini" during parsing 2022-01-03 08:08:08,446 INFO Included extra file "/etc/supervisord.d/vmalert.ini" during parsing 2022-01-03 08:08:08,446 INFO Set uid to user 0 succeeded 2022-01-03 08:08:08,475 INFO RPC interface 'supervisor' initialized 2022-01-03 08:08:08,476 INFO supervisord started with pid 1 2022-01-03 08:08:09,481 INFO spawned: 'postgresql' with pid 15 2022-01-03 08:08:09,483 INFO spawned: 'clickhouse' with pid 16 2022-01-03 08:08:09,486 INFO spawned: 'grafana' with pid 17 2022-01-03 08:08:09,494 INFO spawned: 'nginx' with pid 18 2022-01-03 08:08:09,497 INFO spawned: 'cron' with pid 19 2022-01-03 08:08:09,502 INFO spawned: 'victoriametrics' with pid 20 2022-01-03 08:08:09,513 INFO spawned: 'vmalert' with pid 21 2022-01-03 08:08:09,519 INFO spawned: 'alertmanager' with pid 22 2022-01-03 08:08:09,530 INFO spawned: 'dashboard-upgrade' with pid 23 2022-01-03 08:08:09,538 INFO spawned: 'qan-api2' with pid 24 2022-01-03 08:08:09,551 INFO spawned: 'pmm-managed' with pid 25 2022-01-03 08:08:09,593 INFO spawned: 'pmm-agent' with pid 26 2022-01-03 08:08:10,613 INFO success: postgresql entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,682 INFO success: clickhouse entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,687 INFO success: grafana entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,715 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,763 INFO success: cron entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,767 INFO success: victoriametrics entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,769 INFO success: vmalert entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,772 INFO success: alertmanager entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,825 INFO success: dashboard-upgrade entered RUNNING state, process has stayed up for > than 0 seconds (startsecs) 2022-01-03 08:08:10,827 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,828 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:10,855 INFO success: pmm-agent entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:12,873 INFO exited: qan-api2 (exit status 1; not expected) 2022-01-03 08:08:13,096 INFO spawned: 'qan-api2' with pid 84 2022-01-03 08:08:13,424 INFO exited: qan-api2 (exit status 1; not expected) 2022-01-03 08:08:14,039 INFO exited: pmm-managed (exit status 1; not expected) 2022-01-03 08:08:14,046 INFO spawned: 'pmm-managed' with pid 136 2022-01-03 08:08:14,453 INFO spawned: 'qan-api2' with pid 171 2022-01-03 08:08:14,732 INFO exited: dashboard-upgrade (exit status 0; expected) 2022-01-03 08:08:15,048 INFO success: pmm-managed entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2022-01-03 08:08:15,684 INFO success: qan-api2 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) [AESSCO] percona@monitor: ~ $ date Mon Jan 3 08:37:34 UTC 2022
Please look into it and let us know if require more info/logs.