As part of my gradual induction into the Infra team, I’ve learned that some while ago there was a plan to migrate Fedora from Nagios to Zabbix - that’s Pagure 11393 for the history. This seemed like an ideal thing to work on - it’s a good idea on it’s own, but it also gets me hands on with most (if not all) of the infra.
As I was looking into how to go about doing this, we also had interest from some other members of the wider infra team, notably @dkirwan (who was involved in the last attempt) and @markrosenbaum. So, it made sense to put our heads together and start figuring out some next steps.
Get Zabbix updated to 7.0 in STG (already done, at least for the server)
Check the worst of the custom Nagios checks can be ported to Zabbix, so we have confidence that all the checks can be ported in time
Implement as much in STG as we can to test
Roll out to Prod once we’re confident it’ll work
More details can be found in the doc, and once we’re rolling we will create any needed tickets or other supporting things. We’re happy to have input (or more volunteers!) on the plans
That is a good plan, one thing to take in consideration is that we also want to use AWX with Fedora Ansible repository, which will need a change in the structure of the repository. That could affect the Zabbix work as well.
@zlopez so I’ve tried to keep that in mind - unfortunately that’s a massive project in it’s own right, and the only way I think we could really keep that clean here is to start a new Ansible repo just for monitoring. That feels risky, to me - we’d be creating a split brain problem and dealing with two problems at once (Zabbix, and Ansible). The split brain is inevitable when we tackle this, but at least it can be limited to just Ansible…
However, this is why I’m thinking about how we can move any needed agent monitoring out into tasks/monitoring.yml so that we’re not adding further trouble for ourselves down the road.
I’m doing a spike on how the checks are currently being deployed right now, for both Nagios and Zabbix, and comparing that to other Zabbix deployments I’m aware of, to see what I can learn & form some opinions. I’ll have some results in a few days, I hope!
I’m not sure ‘moving’ things from nagios to zabbix is going to work
well. Or I guess I might not be clear on what you are proposing there.
The reason I say this is two fold: First, removing just specific things
from nagios is likely to be difficult since it generates everything via
ansible templates from inventory. We did have some way to say ‘no
monitoring on this machine’ but it’s clearly not working because we
tried to do this with ‘logdetective02.fedoraproject.org’ and it’s not
working. Secondly, if we remove a check from nagios and add it to
zabbix, we now need to make sure we pay attention to both of them and it
might not be clear whats “only in zabbix” now.
So, I’d suggest just perhaps keeping track somewhere of what is
considered ‘working in zabbix’ as they are moved and then shutting off
nagios (or perhaps just not redeploying it in the new dc?)
OK, so “port” might have been a bad choice of words. What I mean is “copy”, I think, but I avoided that because I don’t necessarily mean “direct copy”, but rather getting staging Zabbix monitoring the same things Nagios - just not perhaps in the same way, if need be. So, copying the end state, I think?
That means we can (at least operationally) ignore Zabbix until we hit the point of it being “done”, because we’re not removing things from Nagios - and the Nagios freeze is only for Prod, by which time most of the Ansible code should have been testing in Staging, and the bulk of the Prod migration can be pretty quick. It does mean we have to choose one of those two scenarios (either finding a way to disable hosts in Nagios, or watching both systems) but hopefully for a short window.
Ultimately it’s about confidence, because we have to trust our monitoring, so before we go to Prod, we have to be happy that Staging is working.
I should add that while I see your point on removing hosts from Nagios because of the Ansible inventory, I hope we can move checks inside Ansible, since that code is defined in one place. But as I said above, I’m still diving into the exact deployment code this week, so I’ll put out more detail on that once I’ve absorbed some things
OK, taking some of this on board, and also some discussions I had with @kevin yesterday, I’ve updated the doc with a few extra points. You can search for “19/3” since I tagged my changes with the date, but briefly:
A note about Koji builders needing much lighter checks
Comment about templates we already have Ansible code for
Thoughts about cleanup of templates/triggers on staging Zabbix
A note that we’ll want to think about OpenShift monitoring
I’m looking at some initial explorations with a few of the staging hosts in the next few days