Wish List

This page aims to list current improvements we would like to make to the cluster. Ask jdhutchin if you have any questions about them.

Fix splunk

 * The upgraded to 4.x broke it.
 * Requires setting up access to charon (useful for other stuff too)
 * Allows our log alerting, etc to get set up again.

Squeeze Upgrade
The following computers need to be upgraded to squeeze:
 * Hermes (complicated, postfix needs to be rebuilt with a small patch)
 * Hera (not too bad, dns CNAMES need to be changed ahead of time)
 * Charon, enlil, kabta

Fix backups

 * The drives on persephone are too small, so we run out of space
 * This is urgent as we can't currently run a full backup cycle
 * We might want to run bacula "Base" backups to save space and time (see the bacula documentation)
 * Also, tape backups need to be run more often than never

Migrate to postgres 8.4

 * Not too bad
 * User notification required
 * Test mediawiki with it.

Upgrade ugcs_libs

 * The package is mostly built, just needs some testing
 * Needs to be deployed to get a rid of deprecation warnings

Audit mailing lists

 * People have signed up random accounts on them and are spying on our mail

Write auto-scanner for malware

 * We need to look at our web serving and auto-detect when we are serving spam off of it.

Mediawiki upgrader

 * The upgrade to 1.16 for users may break, esp if they tried to install >1 mediawiki on their account

More website auto-setup

 * Write more utilities like setup-mediawiki for django, etc so people can easily set up their own web stuff under UGCS.

Add news / tip of the day
Even better, write your own nice little utilities and then let people know about them.

Upgrade the juniper switch
There is a new version if JOS out that we should upgrade to.

Move Kabta
Kabta currently sees a lot of intermittent packet loss in its currently location.

Autofixers
Set up nagios so it more aggressively auto-restarts stuff when it is down.

Maintenance
These are things that we have to do even if there aren't full-time student sysadmins.


 * Account requests and password resets SLA: 1day
 * How do we know: We get emails


 * Fix it when it breaks: Server down
 * SLA: 1hr
 * How do we know: Email alerts for most things, sms to jdhutchin's phone for really urgent things.
 * Owner: jdhutchin


 * Fix minor support requests for things that are broken: SLA: 5days
 * Sooner would be better


 * Answer user questions: SLA: Best-effort
 * It would be nice if we could do this but it isn't a top priority

Software
Fix mex (matlab compiler)

Add support for distributed Mathematica on mortals

Small fixes
Small things that need to be fixed across various services/machines:
 * Email heartbeat
 * Hestia SSL cert
 * Change kabta back to ssh keys after Alex/Raymond add theirs
 * Find the sysadmins PGP key
 * Fix the backup schedules to something sensible

Mail System
See Mail Improvements

Automatic group creation/management
See ugcs groups

Large file hosting
Almost done! See NFS servers Server is running and exporting things correctly. All we need now is disk quotas.

Account creator / password reset

 * Re-work as necessary to ensure robustness
 * Add exception reporting system (email to sysadmins)
 * Write full test suite to ensure quality
 * Fix bug where if you mis-enter your krb pw it half-creates the account anyway and is a pain to straighten out.

Network

 * Write a system that shows us mac/ip/port number
 * Add port mirroring to charon for deseriable traffic
 * Improve firewalls
 * Enable switch port security
 * Fix switch names

Hardware

 * Set up hestia to take over for dionysus - in progress
 * network card flip: put one of the single gigabit cards into charon, and move its two-port gigabit cards into poseidon and persephone

Web hosting

 * Add a failover web server

Global login records
We need to implement some stuff with ldap so we have global login records

Documentation

 * We need a printed-out copy of critical wiki stuff
 * We need to make more documentation about our services for disaster recovery.
 * We need to update all of the core server pages with correct disk setups and currently running services.