Zabbix dashboards

Hey folks. Now that we have zabbix in full use, I wonder if we shouldn’t look at making some dashboards to show some things that might be of interest.

Right now there’s just the global one: Warning [refreshed every 30 sec.] (oddly with two pages in a slideshow?) and the scrapers one ( Warning [refreshed every 30 sec.] )

Some things I’d like to see (based on things I check and/or that people ask about all the time)

  • a signing queue one… Warning [refreshed every 30 sec.] perhaps?
  • something about backlog for s390x builders ( you can see current builds in koji list-tasks --arch s390x | grep buildArch | grep -v FREE but not sure how to get that into zabbix
  • the scraper load one might need to also include kojipkgs01/02 load and koji01/02 load? or could be another one
  • Perhaps some ‘all proxies’ ones? load on all proxies, interface on all proxies… in order to see where a heavy load / request storm is coming from.
  • Perhaps a dashboard showing all the openshift apps requests/bw used. That would allow seeing easily when there’s a bunch of requests for one service.

Any other ones people can think of? @gwmngilfen thoughts?

So most of these seem sensible, I’d love input from folks on what plots would work for them in their areas of expertise.

On the s390 one, where does that run? on koji I assume…? We could add an item to those hosts to report queue data back to Zabbix, and then add it to a dashboard…

Yeah, that s390x one-liner was a koji client command run locally, so that will definitely need translated/figured out for zabbix.

koji has some reports ( Making sure you're not a bot! ) but none of them are particularly useful for us, except perhaps for the cluster health one.

It may be we need to make a custom script that polls koji and gathers stats on how busy s390x is and graphs it.

Ah if its on the builder, thats tricker as we don’t run the agent on the builders. Maybe we should revisit that policy? It could give us some useful data, I guess… but we probably want fewer alerts for them :slight_smile:

If we stick to getting it from koji, then we have a bunch of options in front of us around getting the relevant data into Zabbix

well, the data we want is about the builders, but doesn’t need anything running on the builders.

The data we need is on the hubs (koji01/02). about how many builds are waiting for s390x builders to become available. Even there, we don’t actually need to run anything on those hubs, we need to make a api call to koji to get that data from the hubs.

As to monitoring builders we stopped doing it a long time ago because so much of the normal monitoring is not something we care about on them. ie, we don’t want to know about ‘high cpu’ or ‘large disk i/o’ or the like… because they are building things and thats normal for building things. We could look at doing a subset of things, but also I worry about the overhead of the agent there, memory/cpu taken up by the agent means less for builds, but perhaps that is too small to really matter.