Improvement to the MongoDB replication lag alert
Description
How to test
Install PMM
Add a replicaset or sharded cluster
You have to simulate a lag on one of your replica. For instance, you can use
cfg.members[1].secondaryDelaySecs = 20
- where the [1] is the ID of the node you want to increase the lag on. This ID can be found usingrs.conf()
. You can verify that the lag increased withrs. printSecondaryReplicationInfo()
Full set of commands to apply the lag (assuming ID = 1):
cfg = rs.conf()
cfg.members[1].priority = 0
cfg.members[1].hidden = true
cfg.members[1].secondaryDelaySecs = 20
rs.reconfig(cfg)
Command to verify:
rs. printSecondaryReplicationInfo()
Connect PMM to the portal so that you will get additional alert rules
Identify the “MongoDB Replication Lag is high“ from the “Alert rule templates” tab and click on “+” to add this alert in your PMM. Make sure you set a correct threshold. We can use for instance 10 seconds, as we set 20 in the step above and 20 > 10
Problem 1: Bring this replica down. Verify that the alert is resolved.
Problem 2: Verify that you get an alert just for this replica and not other alerts (e.g for the primary)
How to document
No new doc needed - this is a fix. However, in general, we have this:
AFFECTED CS IDs
is blocked by
Smart Checklist
hideActivity

Roma Novikov August 6, 2024 at 8:01 AM
Please note: The fix for the Alert Template has been merged and released. It is a part of the SAAS component and has begun to be delivered to customers who are connected to the portal. The usual delivery time is 24 hours. During this period, it should be delivered to all customers who are connected and have an internet connection to portal.percona.com. After the template is updated, users should recreate their alert rules to use this new template.

Nurlan Moldomurov June 25, 2024 at 12:12 PM
once you finished could you please fill how to test field so our QAs can test it.

Roma Novikov February 27, 2024 at 8:41 AM
It looks like the data is not very easy to use and it’s better to wait for

Aaditya Dubey February 25, 2024 at 2:52 PM
Hi
Thank you for the report and feedback
Details
Assignee
Santo LetoSanto LetoReporter
Santo LetoSanto LetoPriority
MediumComponents
Needs QA
YesNeeds Doc
NoPlanned Version/s
Story Points
1Affects versions
Smart Checklist Progress
0/1
Details
Details
Assignee

Reporter

Priority
Components
Needs QA
Needs Doc
Planned Version/s
Story Points
Affects versions
Smart Checklist Progress
Smart Checklist
Open Smart Checklist
Smart Checklist
Open Smart Checklist
Smart Checklist

This is a ticket to improve the “MongoDB replication lag alert“.
Problem 1:
As it currently is, it might create issues, e.g. if the boxes go down for e.g. planned OS maintenance the alert will be triggered and it will cause unnecessary noise. If the node is down, no alert should be sent.
One way to improve is the following:
Add a condition so that the lag is calculated only if the node is up.
Example:
{state!="ARBITER",state!="(not reachable/healthy)"}
Problem 2:
Currently, the alert is sent also for the PRIMARY. But this does not make sense, as the PRIMARY has no lag from itself.
A way to fix this and above problem is:
Only alert when state="SECONDARY" since primary, arbiter and non-healthy won't give useful value for lag.
Thank you!