On New Year’s Eve I had to troubleshoot a nasty issue with a mail server which apparently stopped accepting mail for delivery. Looking into the matter I have determined it was due to a named process (running on the same box) failing to resolve hostnames. Upon attempting to restart the process it turned out I wasn’t able to shut it down properly and restarting it took ages for no apparent reason.
I wanted to check if other services were behaving in a similar manner and it turned out to be true. I was unable to properly shut any service down (
killall -9 came to the “rescue”). Starting any service required me to wait a couple of minutes instead of seconds. I’ve checked all kinds of standard stuff: disk usage, memory usage, CPU load, dmesg in search of any segfaults or disk errors, mounts, all kinds of network diagnostics. Everything was peachy and if I didn’t see how that system acted with my own eyes, I would never have believed there was anything wrong with it based on the data I’ve collected. As a last resort, I attached strace to a hanging process and found a clue: the last message that appeared in the trace was related to logging a message to the system log. Thus I’ve restarted the syslog daemon and voilà, everything went back to normal. For some reason the syslog process hung and was effectively delaying all major services (more specifically: all services logging to syslog) running on that box.