Don’t Be Afraid to Reboot Unix Servers
This morning I ran into an infuriating piece of bad advise: avoid rebooting Unix servers at all costs. The advice comes from “When in doubt, reboot? Not Unix boxes” article in InfoWorld by Paul Venezia. The reasons behind Venezia’s advice boil down to one giant misconception: rebooting Unix server as a means of troubleshooting is a trait of an inexperienced sysadmin. I will be the first one to acknowledge that a reboot should never be the first step in fixing server problems. If a problem occurred and you still have access to the server, you need to investigate the issue. Quickly.
Even if you cannot find the problem or the problem is in an area outside of your area of expertise (say, a user application crashed and the vendor tech support tells you to reboot), you need to be reasonably sure that the reboot will not destroy the evidence – log files, core dumps, etc. – and that the server will actually reboot and not, say, hang on a stale NFS mountpoint during shutdown or run into a disk problem during bootup. From Venezia’s article:
“If you shrug and reboot the box after looking around for a few minutes, you may have missed the fact that a junior admin inadvertently deleted /boot and some portions of /etc and /usr/lib64 due to a runaway script they were writing.”
I don’t care how green Venezia’s junior admin is, this is just sad. Still, if the senior admin knows his stuff, he has a secondary boot disk, so even this massive screwup would be no more than a nuisance. If any of my servers go down, the clock starts ticking. Unlike Mr. Venezia the-veteran-Unix-admin, I don’t have the luxury of time. Yes, I would love to spend a few hours dicking around with log files, kernel modules and user applications to find the root cause. Unfortunately, my servers actually do something important, so when one of them goes down – brass upstairs gets intensely curious.
I will spend a reasonable amount of time analyzing the situation before taking any drastic actions like a reboot. In most cases a reboot will not be necessary. But the idea of avoiding rebooting a Unix server just because it is a Unix server is silly. More often than not, my priority is to restore services and doing the root cause analysis is a distant second. If I decide that a reboot may resolve the issue – even if temporarily – I will copy log files to a network share; preserve the list of running processes, accessed files, etc.; generate a core dump and move it off the server and then I will reboot the box.
A couple of weeks ago I was upgrading an engineering application. The process was not complicated: run a vendor-provided script to shut down the application and related services and then run the installation script. Upon completion, the installation script restarted the application and the users confirmed that everything was working properly. The upgrade instructions specifically said that a reboot was not needed. Usually, I would reboot servers after installing new applications or performing major software upgrades, even if a reboot is not formally required. But I already had a reboot scheduled for this server for the upcoming weekend to enable some filesystem changes made earlier in the week. I figured that should be enough.
On Saturday I get a call from the ops telling me that the application I upgraded is not working. My first guess was that something went wrong with the filesystem changes after reboot. I checked it out and everything looked peachy. It turned out that the application upgrade replaced the startup script in /etc/init.d with a new version that had a simple bug: it did not give the license server enough time to start before launching the application. As the result, the application failed to connect to the license server and terminated the startup process.
This problem did not appear during the initial upgrade because the license server was already running. Had I rebooted the server after upgrading the application, I would have spotted the issue and would not have wasted my Saturday night on a long drive from a bar to the customer site. The longer your server stays up, the more such issues it will accumulate. Interactions between various OS components and user applications are complex and difficult to predict. If your server configuration is fairly static – say, a file server or a proxy server – then you can probably get away with rebooting it every couple years just to make sure the power button did not rust through. Application servers are a different story.
As Mr. Venezia correctly notes in one of his other articles, Unix sysadmins are inherently lazy. Very true. We write scripts to automate all tedious, repetitive tasks and not just tedious and repetitive. I have a pre-reboot script that will save all data necessary for troubleshooting and transfer it to another server. The script will also perform a variety of OS and hardware checks to make me feel comfortable the server will reboot. Having a verified secondary boot disk is key here. It’s a model against which I can check my boot partition, the fstab file and any other key configuration files and startup scripts.
I am not advocating rebooting Unix servers as the first line of defense, but it is an important tool. Unix sysadmins, myself included, on occasion like to rub the Windows admins’ noses in our uptime stats. It’s an endless source of much needed entertainment. But its equally important not to get lost in your own myths. We all – and I feel confident about this to speak on behalf of all Unix sysadmins – would like nothing more than to plump down in a big comfy chair with a cup of coffee stolen from the DBA break room and pour over thousands of lines of log files, using our superior deductive reasoning to find the cause of the problem. But this is when the Unix ethos runs into brutal reality of business. Some of my servers support over a hundred engineers. These engineers make upwards of two hundred bucks an hour each. Downtime gets very expensive in a hurry and playing Sherlock Holmes is not always the best way to keep your job.
Reboot your Unix servers after making any major changes to the production environment. Should an unexpected problem come up, it will be easier to deal with it when everything is still fresh in your mind and not six months down the road, when you have to do a reboot to replace a failed system board and suddenly discover that some application wouldn’t load, by which time you forgot all about this application and have to start with the first page of the admin guide. Aside from rebooting after big changes, you also need to have regular maintenance reboots. All production systems must have an up-to-date alternate boot disk. All alternate boot disks must be regularly tested. This has nothing to do with the OS you are running. This is just what sysadmins do as a matter of keeping their jobs.