The PMM Server API (via /v1/readyz) now also returns Grafana status information in addition to that for Prometheus.

Description

Currently, our /v1/readyz readiness pmm-managed API checks only Prometheus status (and, indirectly, returns nothing if nginx, pmm-managed, or PostgreSQL is down). Managed services require a check for Grafana too.

DoD

  • /v1/readyz returns an error if Grafana is no ready (down, starting up, or shutting down).

Implementation

  • Check what Grafana Health API returns when Grafana is starting up or shutting down.

  • Add a method to our Grafana client to access that API. We might need to expect a response body for that, not only the status code.

  • Use that method in readiness API.

Discussion

  • We are not checking `supervisorctl status` output (as used by update mechanism) as this is too brittle and a constant source of various tricky update bugs.

How to test

None

How to document

None

Activity

Alexey Palazhchenko 
April 20, 2020 at 9:41 AM

Merged into .0 branch.

C W 
January 8, 2020 at 12:00 PM

that's fine, so long as all services that should be in the RUNNING state under normal operations are checked

Alexey Palazhchenko 
January 8, 2020 at 11:16 AM

Please rely only on:

  • response code is 200 = container is ready;

  • any other response code or no code at all = container is not ready.

Do not really on other response codes, any response body (including empty JSON), etc. Supporting many failure modes requires a disproportionate amount of effort to the benefits.

C W 
January 8, 2020 at 11:01 AM

we require v1/readyz to confirm that everything is ready, not just Prometheus. In particular, Grafana needs to be monitored as there is no clean way to check that we can interact with the API.

Also, you currently get an HTML 500 response by stopping pmm-managed, so adding that will presumably require NGINX adjustments to return a JSON 500 when requesting with Content-type: application/json

Alexey Palazhchenko 
July 30, 2019 at 10:57 AM

For to plan / prioritize future work.

Done

Details

Assignee

Reporter

Priority

Components

Needs QA

Needs Doc

Fix versions

Story Points

Sprint

Smart Checklist Progress

Created January 23, 2018 at 2:34 PM
Updated April 22, 2025 at 7:12 AM
Resolved April 22, 2020 at 9:34 AM
Loading...