Migration of Nagios -> Zabbix project, new plan

gwmngilfen · March 14, 2025, 4:24pm

Howdy folks,

As part of my gradual induction into the Infra team, I’ve learned that some while ago there was a plan to migrate Fedora from Nagios to Zabbix - that’s Pagure 11393 for the history. This seemed like an ideal thing to work on - it’s a good idea on it’s own, but it also gets me hands on with most (if not all) of the infra.

As I was looking into how to go about doing this, we also had interest from some other members of the wider infra team, notably @dkirwan (who was involved in the last attempt) and @markrosenbaum. So, it made sense to put our heads together and start figuring out some next steps.

The result is listed at Fedora Zabbix Planning doc & notes - HackMD where we’re going into a fair amount of detail about what we want to achieve, but broadly its:

Get Zabbix updated to 7.0 in STG (already done, at least for the server)
Check the worst of the custom Nagios checks can be ported to Zabbix, so we have confidence that all the checks can be ported in time
Implement as much in STG as we can to test
Roll out to Prod once we’re confident it’ll work

More details can be found in the doc, and once we’re rolling we will create any needed tickets or other supporting things. We’re happy to have input (or more volunteers!) on the plans

smilner · March 14, 2025, 4:42pm

Wonderful news and great write up! The plan, risks, mistakes, challenges, and the recommendation to freeze nagios in Phase 3 all make a lot of sense!

markrosenbaum · March 14, 2025, 5:02pm

Zabbix server is now at 7.0LTS in both STG and Prod!

zlopez · March 17, 2025, 2:53pm

That is a good plan, one thing to take in consideration is that we also want to use AWX with Fedora Ansible repository, which will need a change in the structure of the repository. That could affect the Zabbix work as well.

gwmngilfen · March 17, 2025, 5:21pm

@zlopez so I’ve tried to keep that in mind - unfortunately that’s a massive project in it’s own right, and the only way I think we could really keep that clean here is to start a new Ansible repo just for monitoring. That feels risky, to me - we’d be creating a split brain problem and dealing with two problems at once (Zabbix, and Ansible). The split brain is inevitable when we tackle this, but at least it can be limited to just Ansible…

However, this is why I’m thinking about how we can move any needed agent monitoring out into tasks/monitoring.yml so that we’re not adding further trouble for ourselves down the road.

I’m doing a spike on how the checks are currently being deployed right now, for both Nagios and Zabbix, and comparing that to other Zabbix deployments I’m aware of, to see what I can learn & form some opinions. I’ll have some results in a few days, I hope!

kevin · March 17, 2025, 5:46pm

Hey Greg. Thanks for taking this on.

A few comments:

I’m not sure ‘moving’ things from nagios to zabbix is going to work
well. Or I guess I might not be clear on what you are proposing there.

The reason I say this is two fold: First, removing just specific things
from nagios is likely to be difficult since it generates everything via
ansible templates from inventory. We did have some way to say ‘no
monitoring on this machine’ but it’s clearly not working because we
tried to do this with ‘logdetective02.fedoraproject.org’ and it’s not
working. Secondly, if we remove a check from nagios and add it to
zabbix, we now need to make sure we pay attention to both of them and it
might not be clear whats “only in zabbix” now.

So, I’d suggest just perhaps keeping track somewhere of what is
considered ‘working in zabbix’ as they are moved and then shutting off
nagios (or perhaps just not redeploying it in the new dc?)

Otherwise looks great thanks for working on this!

gwmngilfen · March 18, 2025, 9:51am

OK, so “port” might have been a bad choice of words. What I mean is “copy”, I think, but I avoided that because I don’t necessarily mean “direct copy”, but rather getting staging Zabbix monitoring the same things Nagios - just not perhaps in the same way, if need be. So, copying the end state, I think?

That means we can (at least operationally) ignore Zabbix until we hit the point of it being “done”, because we’re not removing things from Nagios - and the Nagios freeze is only for Prod, by which time most of the Ansible code should have been testing in Staging, and the bulk of the Prod migration can be pretty quick. It does mean we have to choose one of those two scenarios (either finding a way to disable hosts in Nagios, or watching both systems) but hopefully for a short window.

Ultimately it’s about confidence, because we have to trust our monitoring, so before we go to Prod, we have to be happy that Staging is working.

gwmngilfen · March 18, 2025, 9:55am

I should add that while I see your point on removing hosts from Nagios because of the Ansible inventory, I hope we can move checks inside Ansible, since that code is defined in one place. But as I said above, I’m still diving into the exact deployment code this week, so I’ll put out more detail on that once I’ve absorbed some things

gwmngilfen · March 19, 2025, 4:51pm

OK, taking some of this on board, and also some discussions I had with @kevin yesterday, I’ve updated the doc with a few extra points. You can search for “19/3” since I tagged my changes with the date, but briefly:

A note about Koji builders needing much lighter checks
Comment about templates we already have Ansible code for
Thoughts about cleanup of templates/triggers on staging Zabbix
A note that we’ll want to think about OpenShift monitoring

I’m looking at some initial explorations with a few of the staging hosts in the next few days

kevin · March 20, 2025, 6:35pm

Thanks, That all looks good to me.

Note that it’s not just koji builders, but also the vmhosts that host
those builders (but thats pretty obvious).

Thanks!

gwmngilfen · April 4, 2025, 4:22pm

Time for an update. As a reminder, the ongoing planning & notes are in the HackMD.

So, right now we are mostly done with phase 0. Staging Zabbix has these changes:

Updated to 7.0.11
Had it’s templates cleaned up so we can see what we’re doing
Switched to native Zabbix Matrix notifications
Set up a new Zabbix bot to route those notifications to #fedora-zodbot:fedora.im
New templates created
- the first is a base template with minimal triggers, suitable for all hosts, even Koji builders
- the second adds the remaining checks from the upstream base as a dependant
Autoregistration is pointed at the first template so all hosts get something
All existing hosts have been moved to the first / second template as appropriate
- This should cut down on the spam from the bvm* hosts since they are in the basic “autoregister” template

This has all been added to Ansible so it can be kept correct.

Next steps are to start the work of porting service monitoring to Zabbix. There’s already some examples of this in the codebase, so I’ll be starting there

gwmngilfen · July 18, 2025, 9:34am

Hello folks, it’s been a while. This obviously got paused during the DC move, as that had priority. As such I’ve not touched it in a few months.

However, with the DC move done, I’m picking it back up. As of this week, we have:

Base Zabbix updated to 7.0
SAML integration deployed (you can log in with a FAS account now, and JIT provisioning appears to work)
Core templates deployed and auto-registration enabled
Ability to easily override templates thresholds from Ansible added to inventory vars
Matrix notifications set up

That’s a good base, and I’ve rolled that out to prod as well, so you can log in to either zabbix.fedoraproject.org or zabbix.stg.fedoraproject.org with your FAS account.

Right now, any FAS account gets read-only view, sysadmin-noc gets Admin rights (so can ack things or edit thresholds) and sysadmin-main gets Superadmin rights (everything). That likely needs some tweaking, but it’s broadly where we want it to be, I think. Likewise I forsee changes to minor things like which Matrix room notifications go to, and suchlike. Largely though, I think the core setup is solid.

So, in the next weeks I’ll be moving on to the “pick an application and monitor it” phase where we copy stuff from Nagios to Zabbix, at least in terms of what we monitor (the how might be different ofc). I’m also looking at how we make it easy to deploy monitoring from a given application role, so it’s kept in one place (right now I think the easiest way is a set of templates deployed in the server role, and then an application just needs to add itself to the relevant group, but I’m still experimenting).

Onwards

Topic		Replies	Views
Deploying Zabbix monitoring stack to Fedora Infra (eventually replace nagios) Project Discussion infrastructure-team	3	400	August 21, 2023
Infrastructure planning Project Discussion risc-v-sig	6	453	November 8, 2022
Docs meeting agenda: 2022-04-06 Project Discussion docs-team	6	327	April 12, 2022
GitHub or Pagure? Project Discussion coreos-wg	12	3325	October 16, 2018
Infra and RelEng Update - Week 18 Community Blog	0	115	May 2, 2025

Migration of Nagios -> Zabbix project, new plan

Related topics