provide in debug/metrics endpoint metrics on http responses and api endpoint responses
Description
Environment
Activity
In terms of endpoint location I was thinking of just extending: https://orchestrator.example.com/debug/metrics. An explicit API endpoint also works.
I tend to think in MySQL terms. So the 2 a / b look good.
I think additionally it’s useful to also record the latency of each call and sum it over time. Similar to MySQL’s P_S.<some_table>.SUM_TIMER_WAIT.
This can then be plotted and you can generate the delta over time and this helps give an idea of whether all calls (as you can collect the counters over time in a similar way) latency changes over time. Right now we have no insight into orchestrator load. We call the API quite frequently as orchestrator is integrated into our tooling so being able to see metrics on the different api endpoints and their behaviour over time would be useful.
I’d see this as: endpoint / response code / { count, latency, success/ failure indicator }
Hi @Simon Mudd ,
If I understand correctly, what is requested is:
add a new endpoint that provides metrics in form of JSON
needed metrics are:
array of httpresponse codes
httresponse code : count
array of all endpoints
endpointN : success count
endpointN: failures count
I’m not sure about total latency gauge and max latency value. Do you mean two global metrics or two per endpoint?
Additionally, how to calculate total latency gauge? Average of all previous requests, average of N previous requests?
Hi @Simon Mudd
Thank you for the report and feedback.
I guess endpoint code is at:
web: https://github.com/percona/orchestrator/blob/master/go/http/web.go#L401-L441
http: https://github.com/percona/orchestrator/blob/master/go/http/api.go#L3737-L4002
Ideally this can be handled by wrapping the endpoint registration in something that can collect these metrics.
Also adding a total latency gauge would be useful as would a max_latency value.
It would be convenient to have counters on debug/metrics api endpoint counters by http response codes, also counters of success/failure of each endpoint location.
This allows orchestrator to provide better SLI metrics on orchestrator behaviour and to also be able to better determine if orchestrator is healthy for each of the endpoints it services.
This is a suggestion of a nice to have improvement to add to orchestrator so we can have more confidence in how well it is behaving.