Failover

We currently only have redundancy in kerberos and ldap. We really should make our other critical services (mail and web) redundant. We should also set up a failover NFS server for shellservers.

For general concepts, see ha-linux at http://www.linux-ha.org/ There are many different ways of redundancy/failover, with their own advantages/disadvantages.

DRBD is Distributed Remote Block Device- think of it as a RAID1 between two computers. It can only be readable at one of the locations, and you use heartbeat (ha-linux) to decide who is primary. See http://www.drbd.org/

NFS
Priority: Medium-low
 * See http://www.linux-ha.org/HaNFS
 * All we need right now is another server with enough disk space
 * Done- hestia/athena

Web

 * We could either do failover (with dionysus), or we could do load-balancing. Sometime in the future, we might need load balancing, but that makes it a lot more complicated.

Load-balancing

 * See http://howtoforge.com/haproxy_loadbalancer_debian_etch
 * Pros: Takes load off of poseidon, and when we get new hardware, makes it easier to add to the web server farm.
 * Downsides:
 * Database access gets complicated, maybe. We would have to set up the defaults correctly so it "just works".  We would also have to add failover database (with drbd, it's not too hard)
 * Postgres and MySQL support master-slave replication. The difficulty of implementing it really depends on how much we care about minimizing possible lost SQL operations slightly before the failover.  Since I suspect that the average user website using both of these database servers isn't orientated towards transactions, so we can't throw nice things like MySQL Cluster at the problem and expect it to solve it for us.  I think it'd be useful to know whether there's anyone who has hardcoded an IP or hostname which we might want to change in order to handle A) failover events and B) multiple servers for load distribution.
 * I'm not sure how intensive of PHP scripting anyone's doing, but if anyone's using sessions, it can be a real pain if you're serving pages from different servers and want to maintain a consistent server state (i.e., the standard flat files of PHP sessions aren't going to cut it anymore).
 * AFS could probably cut it for purposes of handling session information. Most sessions will get handled by poseidon anyways, so any latency in whatever failover server we choose in seeing the updates is inconsequential.  It's nominally more load on AFS, but it's already taking a hit to serve webpages in the first place.

Failover

 * We would also have to set up the database to failover
 * Pros: Easier than load-balancing
 * Cons: Doesn't do load-balancing

Priority: Medium-low

Mail

 * Mail is easier to do failover/load balancing because you just have to add MX records.
 * Because of Mailman, load balancing is harder. We could do failover by putting most of /var/lib/mailman on a drbd.  In this case, dionysus would probably be the secondary MX, and not accept mail if it wasn't primary (triggered by heartbeat when hermes goes down)

Priority: Medium-high