pmm_managed_inventory_agents is missing important information

Description

User story:
As a DBA/SRE on pager needing to respond to a flurry of PMM Agent Down alerts, I am forced to spend time performing cross-referencing before taking further action.

UI/UX:
TBD

Acceptance criteria

Out of scope:
TBD

Suggested implementation:

  • Provide the node_name label in the metric

  • Use the node_name label in the templated messaging

How to test:
TBD

Details:
The pmm_managed_inventory_agents metric contains IDs, but no human-friendly representations of the instance that has lost connectivity/been stopped. In addition, the template uses the unfriendly node_id. This is compounded by the fact that the instance value is pmm-server.

Templates for services seem to use service_name in the alert messaging and nodes use node_name

Here are example annotations from the current implementation:

How to test

1/ deploy a fresh PMM instance

2/ add a mysql (can also be postgreSQL or mongoDB) service to monitoring

3/ create an alert rule from the template called "PMM agent down"; feel free to change the value of the field "Duration" from 60s to a lower value, say 10s

4/ stop the pmm-agent that monitors the service added at step 2 (usually by running `systemctl stop pmm-agent`)
5/ wait for 60 seconds (or whatever Duration value you have defined at step 3)

6/ go to the Alerts page in PMM UI (https://<pmm-server>/graph/alerting/alerts)

7/ you should see an active (firing) Alert notification "PMM agent down"

8/ check and confirm that you can see the node name both in the alert description and the summary

How to document

We have added the `node_name` property to "PMM agent down" alert template that ships with PMM. This will make it more convenient for the user to refer to the node where the failure occurred.

Attachments

7

Activity

Show:

Naresh December 11, 2023 at 3:02 AM

Sure, Thanks

Alex Demidoff December 7, 2023 at 11:50 AM

FYI: I believe we are now providing the node name in the labels object, so the following syntax could be even more useful:

Naresh December 6, 2023 at 1:38 PM

Hi  

Thanks for the details, I will try to create the templates and update you.

Roma Novikov December 6, 2023 at 10:58 AM

Hi !

What you asking is out of the scope of this task. 
if you want more alerts templates for all agent's problems  you can use  metric `pmm_managed_inventory_agents` and based on the label `agent_type`  create more alert templates  [https://docs.percona.com/percona-monitoring-and-management/get-started/alerting.html#template-example ]
The current template is :

---
templates:
  - name: pmm_agent_down
    version: 1
    summary: PMM agent down
    expr: 'pmm_managed_inventory_agents{agent_type="pmm-agent"} == bool 0 '
    for: 1m
    severity: critical
    annotations:
      description: |-
{{        PMM agent on {{ $labels.node_id }} cannot be reached. Host may be down.}}
{{      summary: PMM agent is down ({{ $labels.node_id }})}}

 

 

so you can use it to create more.

 

Naresh December 5, 2023 at 3:31 PM

Can you please update on the above?

Done

Details

Assignee

Reporter

Priority

Components

Needs QA

Yes

Needs Doc

No

Planned Version/s

Fix versions

Story Points

Affects versions

Smart Checklist

Created August 30, 2023 at 4:29 PM
Updated March 5, 2024 at 10:11 PM
Resolved September 18, 2023 at 11:41 AM