Alerting

We have a variety of automated alerting at UGCS to let us know when things are breaking or already broken.

Notes on alerts
Some alerts are critical, so it is nice if they go to a cell phone ("paging device"). Each carrier usually has their own way of sending an sms to a phone via an email address. As of 2011, Verizon is tendigitnumber@vtext.net (like 6505555555@vtext.net). AT&T is tendigitnumber@txt.att.net.

Sometimes these alerts from nagios will get really spammy and you may get hundreds of texts (some tweaking of nagios alert rate limiting should be done). Make sure this will not bankrupt you if you put your phone in these config files.

We've tried to set up most of our other alerts so they go to sysadmins-alerts@ugcs.caltech.edu, instead of sysadmins@, so that people can be on sysadmins and not get all of the alert spam.

=Nagios Alerting= Most of our alerts come from Nagios. These include things like host down, service not running, or other problems. You should edit (in cfengine) nagios3/conf.d/contacts.cfg and nagios3/conf.d/critical_notices.cfg to add yourself to the list. There is also a list of 'sms-all' that contains mail aliases for pagers.

Some services in nagios have a separate alert like "Critical load", etc. These are the alerts that will get sent to paging devices, so they typically have higher thresholds or longer hold times before they fire. By default, they will go through IMSS's mail servers instead of ours so we can still get notified if our mail system is down.

=Splunk Alerts= Splunk does regular scans of all of our logs and can alert based on log messages it sees. See Splunk Alerts for more information.

=Kabta ping test= There is a script running on Kabta that pings UGCS and complains if it can't. This should definitely go to a paging device if possible. You will have to ssh to kabta to edit the script (it is called from a root crontab or /etc/cron.d entry).

=Email Heartbeat= We have an end-to-end email testing system that sends a message through UGCS once every 5 minutes and complains if it is too late. You should edit the config file in hermes:/etc/email_heartbeat to add yourself.

By default these go through IMSS's mail server (this is pretty clear from the config file). You should probably send them to a paging device and a non-UGCS email.

=Cron Jobs= While not technically alerts, cron job failures go to root@ which goes to sysadmins@. It would be nice if they were redirected to sysadmin-cron or something because they can be very spammy.