provide in debug/metrics endpoint metrics on http responses and api endpoint responses

Description

It would be convenient to have counters on debug/metrics api endpoint counters by http response codes, also counters of success/failure of each endpoint location.

This allows orchestrator to provide better SLI metrics on orchestrator behaviour and to also be able to better determine if orchestrator is healthy for each of the endpoints it services.

 

This is a suggestion of a nice to have improvement to add to orchestrator so we can have more confidence in how well it is behaving.

Environment

None

Activity

Show:

Simon Mudd December 11, 2024 at 12:36 PM

In terms of endpoint location I was thinking of just extending: https://orchestrator.example.com/debug/metrics. An explicit API endpoint also works.

Simon Mudd December 11, 2024 at 12:34 PM

I tend to think in MySQL terms. So the 2 a / b look good.

I think additionally it’s useful to also record the latency of each call and sum it over time. Similar to MySQL’s P_S.<some_table>.SUM_TIMER_WAIT.

This can then be plotted and you can generate the delta over time and this helps give an idea of whether all calls (as you can collect the counters over time in a similar way) latency changes over time. Right now we have no insight into orchestrator load. We call the API quite frequently as orchestrator is integrated into our tooling so being able to see metrics on the different api endpoints and their behaviour over time would be useful.

I’d see this as: endpoint / response code / { count, latency, success/ failure indicator }

Kamil Holubicki October 30, 2024 at 8:37 AM

Hi ,

If I understand correctly, what is requested is:

  1. add a new endpoint that provides metrics in form of JSON

  2. needed metrics are:

    1. array of httpresponse codes

      1. httresponse code : count

    2. array of all endpoints

      1. endpointN : success count

      2. endpointN: failures count

I’m not sure about total latency gauge and max latency value. Do you mean two global metrics or two per endpoint?

Additionally, how to calculate total latency gauge? Average of all previous requests, average of N previous requests?

Aaditya Dubey September 27, 2024 at 11:28 AM

Hi

Thank you for the report and feedback.

Simon Mudd September 26, 2024 at 1:47 PM

I guess endpoint code is at:

  • web:

  • http:

Ideally this can be handled by wrapping the endpoint registration in something that can collect these metrics.

Also adding a total latency gauge would be useful as would a max_latency value.

Details

Assignee

Reporter

Needs QA

Yes

Components

Priority

Smart Checklist

Created September 26, 2024 at 1:40 PM
Updated December 11, 2024 at 12:36 PM