Jan25 Kerberos Incident

On January 25, 2010, many of the shellservers were unable to complete any kerberos operations. The cause was an upgraded kerberos library from testing which did not work well with our existing kerberos libraries. The problem was fixed about an hour after it was first noticed by both users and UGCS admins.

Symptoms
Kerberos operations failed, as did getting AFS tokens. tobin@melpomene:~$ kinit kinit: relocation error: /usr/lib/libdes425.so.3: symbol des_IP_table, version k5crypto_3_MIT not defined in file libk5crypto.so.3 with link time reference was a common error message.

Users were generally able to log in but could not get AFS tokens, and therefore couldn't use their home directories.

Cause
The cause of the problem was that Debian testing upgraded its Kerberos libraries to krb1.8-alpha, when the rest of the cluster including userspace programs were using Kerberos 1.6.

Solution
We downgraded the appropriate packages (libkrb5support, libk5crypto3, libkrb5-3, libgssapi-krb5) to 1.7+dfsg4 using deb archives in /var/cache/apt/archives.

Prevention
To prevent this problem from happening again, we added pin lines in /etc/apt/preferences to pin those packages to 1.7+dfsg4. This was verified to prevent newer versions from being installed with an aptitude safe-upgrade install. There weren't any obvious log messages that we could alert on.