[PoC] Intuitive cluster navigation for MongoDB

Description

(This concept may have value to the other sharded DBMSes. I'll stick to the MongoDB case here though.)

Grafana has the idiom of showing metrics for all servers that have the metric and puts the responsibility on the user to filter out the servers they don't want. This is slower and unintuitive for people who've used dedicated admin management software for a distributed service (dbs, search engines, clustered this and that).

I think first-time users of a grafana dashboard made for MongoDB metrics may not even find the cluster name select control before they try other more eye-catching GUI such as the edit action on a graph (which takes them off-page).

Grafana's softly-configurable, WYSIWYG-created dashboards:

  • Do not give you a way to change the visual appearance so that the page easily confirms to the unconfident user that they're there; they are looking only at the one logical database they intended.

  • Do not have easy visual confirmation that they didn't over-filter (e.g. accidentally filter to servers of just one shard instead of all shards)

  • Do not have easy visual confirmation whether all nodes that you would expect are there. (E.g. if one shard of N has one less replicaset node than the others because its mongodb_exporter service is off, that should be kind of an obvious hole visually)

  • Do not show db segments side-by-side with the peer segments. I.e. shard-level metrics side by side only with other shards (in the same cluster only), and replicaset nodes side by side with only its own replicaset peers.

I propose that we have:

  • A "MongoDB Deployments" dashboard

    • It would have no metrics. It is a list of cluster names, plus some short text facts: "Cluster" vs. "Replicaset" (for non-sharded replicaset), count of shards in the cluster / nodes in the non-sharded replicaset; min+max detected version(s)

    • The names are links that open the "MongoDB Cluster" dashboard if a cluster, or to "MongoDB Replicaset" if a non-sharded replicaset.

  • A "MongoDB Cluster" dashboard

    • It has a compulsory parameter requiring the cluster name/id. (If it is absent put a link back to "MongoDB Deployments"

    • It has the cluster name and version(s) and db size as big fat header the top. If this header and those fields can't fit simultaneously with the standard top-right grafana controls (interval, auto-refresh) at the same time I would say the grafana controls should be moved down.

    • A single, billboard-sized metric graph control under that. Let's say ~80% of the width of the page, ~40% of the height. The metric shown count be the op counts by default.

    • The shards will be listed below - with each being a link to the "MongoDB Replicaset" dashboard

    • The metrics will be a time series for each shard. This will show you if the load is even between the shards, or when it is not, show you which shard is the most loaded.

    • The metrics in the big graph should be selectable, in-place. E.g. change from op counts to latency to replication lag to data size etc etc. per shard. It's tempting to put them in multiple graphs vertical aligned, but its more important to make the links to the shard's replicaset visible on the screen without scrolling

  • "MongoDB Replicaset"

    • Should be a table of N x X metric graphs. The N nodes of the replicaset are the columns, the X metrics will be the rows. The column headers will the be the node's host:port address and replicaset status (I.e. show which one is primary, which are secondary, and which are unhealthy)

    • As with the cluster page it should be an error if it wasn't given a filter (tuple of a cluster name/id + replicaset name) that ensures it isn't selecting and comparing nodes in different replicasets (or clusters).

    • If a replicaset has too many nodes to fit on the screen? (Todo - unsolved problem)

  • "MongoDB Node"

    • Like the current "MongoDB Overview", but limited to a single mongod node. All the metrics that has plus the storage engine metrics (see next top-level point) worth showing.

  • Auto exclusion of non-applicable metrics by node type

    • For any page the metrics should automatically exclude the metrics that don't apply to the node discovered. E.g. a cluster that is using WiredTiger will have those metrics such as "WT Cache Activity", "WT concurrency tickets" visible and the graph controls (and any wrapping DOM elements) for MMAP metrics will be automatically hidden. Along with this the storage-engine dashboards should be deprecated.

  • Some organizations (my gut feeling is < half) have a desire to see all clusters compared on a few metrics (total ops/sec, latency, data size, network traffic in bytes). The new "Xxxx Services Overview" dashboard covers this need i.m.o.

  • Most organizations want a dashboard that shows if any nodes are down. The new "Xxxx Services Overview" dashboard covers this need i.m.o.

At the same time as making these changes I think we can deprecate the existing views except "MongoDB Services Overview".

How to test

None

How to document

None

Attachments

3

Smart Checklist

Activity

Show:

Jira Bot November 17, 2021 at 9:57 PM

Hello ,
I'm jira-bot, Percona's automated helper script. Your bug report is important
to us but we've been unable to reproduce it, and asked you for more
information. If we haven't heard from you on this in 3 more weeks, the issue
will be automatically closed.

Details

Assignee

Reporter

Priority

Components

Needs QA

Yes

Needs Doc

Yes

Smart Checklist

Created December 28, 2020 at 11:17 AM
Updated June 27, 2024 at 12:19 PM