Improvement to the MongoDB replication lag alert

Description

This is a ticket to improve the “MongoDB replication lag alert“.

Problem 1:

As it currently is, it might create issues, e.g. if the boxes go down for e.g. planned OS maintenance the alert will be triggered and it will cause unnecessary noise. If the node is down, no alert should be sent.

One way to improve is the following:

  • Add a condition so that the lag is calculated only if the node is up.

Example:

{state!="ARBITER",state!="(not reachable/healthy)"}

Problem 2:

Currently, the alert is sent also for the PRIMARY. But this does not make sense, as the PRIMARY has no lag from itself.

A way to fix this and above problem is:

  • Only alert when state="SECONDARY" since primary, arbiter and non-healthy won't give useful value for lag.

Thank you!

How to test

  • Install PMM

  • Add a replicaset or sharded cluster

  • You have to simulate a lag on one of your replica. For instance, you can use cfg.members[1].secondaryDelaySecs = 20 - where the [1] is the ID of the node you want to increase the lag on. This ID can be found using rs.conf() . You can verify that the lag increased with rs. printSecondaryReplicationInfo() 

    • Full set of commands to apply the lag (assuming ID = 1):
      cfg = rs.conf()
      cfg.members[1].priority = 0
      cfg.members[1].hidden = true
      cfg.members[1].secondaryDelaySecs = 20
      rs.reconfig(cfg)

    • Command to verify:
      rs. printSecondaryReplicationInfo()

  • Connect PMM to the portal so that you will get additional alert rules

  • Identify the “MongoDB Replication Lag is high“ from the “Alert rule templates” tab and click on “+” to add this alert in your PMM. Make sure you set a correct threshold. We can use for instance 10 seconds, as we set 20 in the step above and 20 > 10

  • Problem 1: Bring this replica down. Verify that the alert is resolved.

  • Problem 2: Verify that you get an alert just for this replica and not other alerts (e.g for the primary)

How to document

No new doc needed - this is a fix. However, in general, we have this:

AFFECTED CS IDs

CS0044176

Smart Checklist

hide

Activity

Show:

Roma Novikov August 6, 2024 at 8:01 AM

Please note: The fix for the Alert Template has been merged and released. It is a part of the SAAS component and has begun to be delivered to customers who are connected to the portal. The usual delivery time is 24 hours. During this period, it should be delivered to all customers who are connected and have an internet connection to portal.percona.com. After the template is updated, users should recreate their alert rules to use this new template.

Nurlan Moldomurov June 25, 2024 at 12:12 PM

once you finished could you please fill how to test field so our QAs can test it.

Roma Novikov February 27, 2024 at 8:41 AM

It looks like the data is not very easy to use and it’s better to wait for

Aaditya Dubey February 25, 2024 at 2:52 PM

Hi

Thank you for the report and feedback

Done

Details

Assignee

Reporter

Priority

Components

Needs QA

Yes

Needs Doc

No

Planned Version/s

Story Points

Affects versions

Smart Checklist Progress

Smart Checklist

Created February 25, 2024 at 12:15 PM
Updated January 15, 2025 at 4:07 PM
Resolved August 19, 2024 at 2:50 PM