Good afternoon infra-folk, time for a Zabbix update 
As we go into F43 beta freeze, the current state is that pretty much all the hosts we monitor in Nagios are now present in Zabbix. Last week I added the zabbix_agent
role to all the remaining playbooks and ran them, so any host which ran the nagios_client
role now have Zabbix too
There are still a bunch of ping-only hosts that do not run an agent (builders, mgmt interfaces, external services) monitored direct from the server - these are ~50% done but not complex and will get finished off.
That means we’re ready to tackle services. I looked at the ~1100 services we currently monitor, and then removed all the ones we already get for “free” from the base template (such as disk, swap, cpu, etc). I then did my best to group/count them, and here’s what I got:
service list
131 Rsyslogd Process
129 mail_queue
63 http-*
60 Check queue *
31 Zombie Processes
31 SSH-virtservers
31 Cron Daemon
23 IPA Free IDs
21 Check datanommer *
20 ICMP-Ping4-vm-builders
11 Check_Raid
8 vpnclients
8 Check MirrorList * Cache
7 Check bus *
6 Check bus server *
5 Check Fedora countme *
5 Check CentOS countme *
4 Varnish Process
4 proxy* mirrorlist docker container
4 Check TicketKey age
4 Check proxies for oversubscription
4 Check ostree summary age
3 IPA Replication Status
3 Check FAS DB
2 SSH-bastion
2 openvpn CRL expiry
2 DNS: fp.o
2 Check * memcached daemon
1 Sigul bridge Process
1 Service
1 Redis/celery queue
1 mail_queue_redhat
1 http(s)-*
1 Check read-only filesystem
1 Check NFS File Locks
1 Check Nagios
1 Check MySQL Backup
1 Check Merged Log
1 check mailman api
1 Check Koji*
1 certgetter-http
I may have grouped wrongly is some cases, but still, thats ~40 types of thing to monitor, which is not so bad.
As a start, I looked at the mail_queue
item, which runs mailq
and alerts if mails are stuck. It’s fairly trivial to set up in Zabbix, and looks like this:
- add a one-line file to the drop-in conf dir, eg
/etc/zabbix/zabbix_agentd.d/mailqueue.conf
- content looks like:
UserParameter=item.key,thing-to-run | processing-stuff
- this makes the item available to monitor via the agent
- add
item.key
to a Zabbix template, and appropriate triggers
- add the template to the appropriate hosts
- the agent on those hosts will then start running the command and reporting the value
Now, this throws up 2 questions. Firstly, and probably easier to answer - the agent will hit SELinux restrictions on running the commands that the NRPE agent runs, because we built a policy for NRPE. I’m researching how other Zabbix users handle SELinux, but I’m leaning towards building a new policy to distribute with the agent. Opinions welcome on this - and I’m also going to take a look to the Ansible community and see if how we deploy selinux policies is still solid.
Second, and more philosophical, is where to put the Ansible code for each check. Today, we have everything in the Nagios server role - but I’m already seeing instances of hosts/services missing in places you’d expect them (certainly one or two of the ping-only mgmt interfaces in rdu3 are). My instinct here is to lean towards the Zabbix monitoring code being added to the role that deploys the service (so, postfix
for the check above) - this means that adding the role automatically deploys the necessary monitoring too.
Does this approach make sense / concern anyone? (especially @kevin @james @phsmoura :P) There will always be some things that have be run against the Zabbix server directly (such as the OCP workers, which we don’t run Ansible on today), but I’d like to keep that to a minimum, and have hosts declare their own monitoring needs. I’m trialling some of these ideas out on STG during the freeze, so I can link to PRs with example implementations if that helps.
The best news is that once we’ve agreed the pattern, it should be fairly straightforward to crank through the list above and implement the service templates - and then we’re really close to retiring Nagios
. Oh, and if any of those services aren’t needed anymore, let me know, I’ll be glad to ignore them!