MongoDB Replication Lag Alert - Issues

Description

Problem 1 (blocker):

As soon as you import the rule, if you try to open it, it gives an error - image attached:

 

Problem 2:

Non humanly understandable value is written in the Alert description. Example below (I masked the real data with ***):

 

Possible Fix to problem 1:

Remove wrong double ) in

(or add one at the beginning of the statement?)

 

Possible Fix to problem 2:

Use

instead of

 

Thanks

How to test

Problem 1

  • Install PMM v2.42.0

  • Connect it to the portal so that you will get additional alert rules

  • Identify the “MongoDB Replication Lag is high“ from the “Alert rule templates” tab and click on “+” to add this alert in your PMM

  • From the “alert rules” tab, open the alert by clicking by clicking on the command “view” related to this alert

  • As soon as it is opened, you will see the error of the attached image (or, you will not see it in the fixed version)

 

Problem 2

  • After the alert is added

  • You have to simulate a lag on one of your replica. For instance, you can use cfg.members[1].secondaryDelaySecs = 20 - where the [1] is the ID of the node you want to increase the lag on. This ID can be found using rs.conf() . You can verify that the lag increased with rs. printSecondaryReplicationInfo() 

  • After the lag increased you should get the alert (ensure it is higher than the threshold you have)

  • Verify that instead of getting “[ var='A' labels={cluster=***, name=***, set=***} value=*** ] you get something like

    • MongoDB Replication Lag on your_node_here (replicaSet your_replica_set_here) is above the defined threshold of your_threadshold_here secs. Current value is 20secs.

How to document

No new doc needed - this is a fix. However, in general, we have this: and is missing this alert - should be added

AFFECTED CS IDs

CS0044176

Attachments

1

Activity

Show:

Nailya Kutlubaeva 
August 13, 2024 at 10:47 AM

Verified:
MongoDB Replication Lag on localhost:27019 (replicaSet rs1) is above the defined threshold of 10secs. Current value is 19secs.

Aaditya Dubey 
July 23, 2024 at 7:12 AM

Hi

Thank you for the report.
Please share the exact reproducible steps to debug the issue further.

Done

Details

Assignee

Reporter

Priority

Components

Labels

Needs QA

Needs Doc

Planned Version/s

Fix versions

Story Points

Affects versions

Smart Checklist Progress

Created July 22, 2024 at 2:47 PM
Updated 4 days ago
Resolved August 13, 2024 at 8:12 PM