Nagios Improvements

We should update Nagios to better reflect what the servers are doing:

Services to watch

 * apollo - AFS, rsync
 * athena - AFS
 * demeter - DHCP, TFTP
 * hera - kerberos (probably not needed)
 * I would definitely monitor kerberos
 * hermes - AFS,
 * You should also monitor "internal" mail services like amavis, spamassassin, etc. I know we have end-to-end email heartbeat, but checking individual services would be good too.
 * amavis
 * Mailman
 * spamassassin
 * You could also check for a "correct" number of imap-login processes.
 * hestia - add NFS
 * dionysus
 * sks (PGP key server)
 * nagios (yes, you need to make sure nagios is running... otherwise all of this may end up being for naught)
 * persephone - bacula
 * zeus - kerberos (probably not needed)
 * Sometimes ldap replication breaks and we don't notice it. I think there are ways to check the status of the link, but I haven't been able to find them recently
 * poseidon
 * postgres
 * mysql
 * kabta
 * postfix (secondary MX record)

Shellservers

 * distcc

All Machines

 * All machines have their own postfix that is used to send messages via sendmail. If it goes down, email on a host can pile up and not get delivered.
 * (almost) All machines have raid that we should check with nagios in addition to mdadm
 * All non-shellservers run apcupsd that should be checked
 * rwhod
 * ntpd (time drifts can cause kerberos to break)
 * bacula-fd

Types of monitoring
There are a couple types ways to monitor services
 * 1) Make sure the process exists.  This is the simplest way and can catch many problems.  For some services it may be the only non-horribly complex way to check it
 * 2) * The checks for multiplecron and svn work this way.
 * 3) * You can also look for a "correct" number of processes. If we have fewer than 2 or 3 apache processes on poseidon, for example, something is probably wrong.  This method can be good and catch subtler bugs, but has a higher false-positive rate.
 * 4) Make sure you can reach it over the network.  This usually involves making a test tcp connection and making sure it doesn't get refused
 * 5) * ssh, http, https, etc get tested this way
 * 6) Functionality test.  These are the hardest to write but the most useful, as they can catch almost any problem.  We have a few of these set up:
 * 7) * Mail. There is a heartbeat script that sends an email every 5 minutes through a path that includes our incoming SMTP server as well as the various delivery mechanisms (both mailman and local delivery)
 * 8) * Web. Nagios checks both the status of the http port (is it open?) as well as several test pages ( like http://jdtest.caltech.edu/test.php ) that test the scripts that let users run scripts as their own user (and database connectivity)