Zabbix noise levels in chat - where to we want to get to?

Howdy Infra-folk,

I need to go do a proper Zabbix update on the other thread, but I wanted to start a sidebar on exactly how much we’d like to see from Zabbix in chat.

Where we’re at

Today, we have 2 problems - first that Nagios is already fairly chatty, and second, Zabbix is even worse right now. Some of this can be improved with clever use of dependant items in Zabbix, but since I intend to tackle that kind of cleverness after we get it live, we need to figure something simpler for now.

My goal is to make “things that appear in Matrix” == “things we should jump on”. Too often, we look at an alert with fatigue, saying “oh, we can ignore that”, and those items should not (IMO) be in chat. The lower level items should be logged, and should be reviewed, but chat should get our attention when needed, and we’re too jaded for that today.

Where Zabbix helps

Nagios config is limited - it handles a single test, with only a hard limit, with no history, and only at two levels (WARN and CRIT). By comparison, Zabbix has:

  1. Have 5 levels (Info, Warn, Avg, High, Disaster)
  2. Multiple triggers per item (or combination of items)
  3. Ability to mix hard limits, time averages, trend analysis, etc.

There are many things that are useful to know longer term, without needing to jump on it right away, and I’d rather find a way to lower the fatigue than outright delete useful triggers just because they’re noisy. We should make use of the above things to cut the “cry-wolf syndrome”.

A proposal in 3 parts

I think we’ve got three things we can do to help cut the noise (which is essential before we switch to Zabbix as our main monitoring):

  1. Restrict higher level triggers for Matrix (e.g. only Average+ or High+ go to Matrix, with the rest staying in the Zabbix UI)
  2. Make sure the trigger levels we have so far are right (based on the threshold above)
  3. Regularly review top triggers (via the Top 100 report in the UI) and either fix them or delete them from the templates (looking at you, swap usage and disk speeds)

I’m especially curious to know thoughts on the first item - to me, having higher-level alerts in Matrix and keeping the UI tab open in my browser for occaisonal checks in the day seems a good way to be. Items 2 & 3 are probably an ongoing review - we’ll find issues that should have been higher or lower over time, and put them in the right places.

Alternatives

We could split the alerts, keeping the dedicated Zabbix room on Matrix, but also putting High+ into the NOC room. I’m not keen on that, we’ll have to split our attention even more on chat. Having the lower-priority items in the UI seems better?

@kevin @james @zlopez interested in your thoughts here? How would you bring down the Zabbix noise?

1 Like

I think just having matrix to show up zabbix triggers for high+ should be enough.

We can do the reviews on weekly infra meeting as part of monitoring topic.

Yeah, that all sounds completely reasonable to me and seems like a good approach.

We can of course always adjust things.

Only slightly related, I wonder if we could set the matrix gateway/hook to send say X per Y and if it’s more than that, just send a ‘LOTS OF ALERTS, GO TO WEB INTERFACE’. That might save us from floods when some important thing (or everything) goes down.

Thanks both. Regarding the checks that we might want to delete, here’s some ideas from the Top 100 Triggers list this week:

  • High swap space usage
  • sdX: Disk read/write request responses are too high
  • Rsyslogd: too many processes
  • Zombie Processes

Do these still provide useful info? Should we delete them, or leave them at the UI-notification level?

I think all of those except the rsyslog one are fine to stop monitoring.

The rsyslogd one might be… but I’d like to actually know why it sometimes spikes number of processes. Is it that its doing a lot of work and fires off threads? Or something crashing and causing it to restart? It’s pretty hard to catch tho, so not sure how we can fully track it down. ;(

1 Like