Howdy Infra-folk,
I need to go do a proper Zabbix update on the other thread, but I wanted to start a sidebar on exactly how much we’d like to see from Zabbix in chat.
Where we’re at
Today, we have 2 problems - first that Nagios is already fairly chatty, and second, Zabbix is even worse right now. Some of this can be improved with clever use of dependant items in Zabbix, but since I intend to tackle that kind of cleverness after we get it live, we need to figure something simpler for now.
My goal is to make “things that appear in Matrix” == “things we should jump on”. Too often, we look at an alert with fatigue, saying “oh, we can ignore that”, and those items should not (IMO) be in chat. The lower level items should be logged, and should be reviewed, but chat should get our attention when needed, and we’re too jaded for that today.
Where Zabbix helps
Nagios config is limited - it handles a single test, with only a hard limit, with no history, and only at two levels (WARN and CRIT). By comparison, Zabbix has:
- Have 5 levels (Info, Warn, Avg, High, Disaster)
- Multiple triggers per item (or combination of items)
- Ability to mix hard limits, time averages, trend analysis, etc.
There are many things that are useful to know longer term, without needing to jump on it right away, and I’d rather find a way to lower the fatigue than outright delete useful triggers just because they’re noisy. We should make use of the above things to cut the “cry-wolf syndrome”.
A proposal in 3 parts
I think we’ve got three things we can do to help cut the noise (which is essential before we switch to Zabbix as our main monitoring):
- Restrict higher level triggers for Matrix (e.g. only Average+ or High+ go to Matrix, with the rest staying in the Zabbix UI)
- Make sure the trigger levels we have so far are right (based on the threshold above)
- Regularly review top triggers (via the Top 100 report in the UI) and either fix them or delete them from the templates (looking at you, swap usage and disk speeds)
I’m especially curious to know thoughts on the first item - to me, having higher-level alerts in Matrix and keeping the UI tab open in my browser for occaisonal checks in the day seems a good way to be. Items 2 & 3 are probably an ongoing review - we’ll find issues that should have been higher or lower over time, and put them in the right places.
Alternatives
We could split the alerts, keeping the dedicated Zabbix room on Matrix, but also putting High+ into the NOC room. I’m not keen on that, we’ll have to split our attention even more on chat. Having the lower-priority items in the UI seems better?
@kevin @james @zlopez interested in your thoughts here? How would you bring down the Zabbix noise?