Feb14 2012 AFS problems

On the morning of Feb 14, 2012, UGCS had a large AFS outage.

Timeline
All times in PST.


 * 7:45: alexr notices AFS problems, reboots apollo, athena.
 * 8:20: jdhutchin starts salvage on apollo, athena
 * ~9:00: Salvage on athena finishes.
 * 9:20: jdhutchin restarts postfix on hermes, queues flush quickly.

Root cause

 * Athena hung and was force-rebooted with impi. This forced all the volumes to require a salvage.
 * Apollo hung and was rebooted by hand with reboot.
 * While athena was down and before the salvage, root.afs' RW copy was offline.

Prevention

 * Upgrading to OpenAFS 1.5 and enabling demand attach will help reduce the time to salvage (volumes can be salvaged as needed, instead of forcing all volumes to be salvaged before the FS can start up).
 * Moving more paths to /afs/ugcs instead of /afs/.ugcs.
 * Possibly moving mail delivery/retrieval to /afs/ugcs instead of /afs/.ugcs, and ensuring that dovecot can work without a home directory (unclear if this is possible, there are some nasty warnings in the config file that say don't do this, so this needs some investigation).