Nagios

We currently use Nagios as a service monitor. It periodically checks nearly every service in UGCS, and makes sure that the service is at least responsive. It is currently running on Hestia.

Configuration
Nagios is now configured through cfengine. All of the relevant files are contained in /afs/ugcs/ugcs-admin/cfengine/hosts/nagios/. The nagios3/ folder contains the actual runtime command, host, service, etc definitions, which are copied to /etc/nagios3 on the monitoring server. The plugins/ folder contains plugins copied to /usr/local/lib/nagios/plugins. The nagios server, by default, gets a copy of all plugins, while everything else gets certain scripts from that folder. Certain files are automatically generated by configurator, see the Makefile in the conf.d/ directory for more information.

NRPE
Many services (like linux mdadm raid, or nfs mounts) are monitored locally. As such, we use NRPE to check individual plugins on remote machines. NRPE files are also copied via cfengine. The universal file, /etc/nagios/nrpe.cfg (the same on all machines) is generated from /afs/ugcs/ugcs-admin/cfengine/global/nrpe.cfg. The shellservers have an auxilliary file, /etc/nagios/nrpe.d/nrpe-shellserver.cfg, which is copied from /afs/ugcs/ugcs-admin/cfengine/global/shellserver/nrpe-shellserver.cfg.

statd
We use statd to report various loads and usages on local machines, and these are recovered for performance processing by nagios using statd.

Website
You can view the current status of nagios at https://nagios.ugcs.caltech.edu It requires a valid UGCS login to access, since nagios's website is notorious for vulnerabilities. If you are a sysadmin and your name is in /etc/nagios3/cgi.conf at the appropriate places, you will be able to run commands from the website (things like ignoring problems or re-scheduling the next update of a host). See where current sysadmins are to see where to place your name (as username@UGCS.CALTECH.EDU)

check_ldap
check_ldap won't work unless you modify its plugin command. You need to modify /etc/nagios-plugins/config/ldap.conf so that it gives $HOSTNAME$.ugcs.caltech.edu instead of $HOSTADDRESS$. Otherwise it just won't work.

Automated fixes
Nagios tries to use Remote System Management to complete some simple fixes. It can currently restart apache2 and mount /mnt/shared and /ug/nfs/cfengine. In the future it may try to fix more problems on its own, like rebooting frozen machines.

Nagios configurator
There is a set of scripts to automatically generate nagios service and hostgroups. They are in the configurator directory. The base python file is NagiosInfo.py. It contains two classes, Hostgroup and Servicegroup that contain information on the appropriate services.

A hostgroup is a bunch of hosts that are similar. A hostgroup can have classes (from configurator) added to it with hgroup.add_class(classname). You can also include/exclude specific machines by adding their names to the lists hgroup.includes or hgroup.excludes.

New Nagios configuration
We will be overhauling the Nagios configuration setup to make it easier to manage and add servers/services. In particular, we will (at first) only be using configurator to generate the basic hostgroups coreservers, shellservers, and mortals. We will do this instead of generating all obvious hostgroups from configurator, as we don't add servers often enough to require quick reconfiguration.

Each group of hosts and associated services (such as standard load checks, or all nfs checks) will be grouped into individual files. Then, we will have a servicegroups file to hold all service not in other files.