24×7 software is a whole ‘nother ball of wax

Today I was hacking away, and getting really annoyed at how slow my laptop had become. I check ‘top’ and saw that several copies of Google Chrome were running, one with over 92 hours of accrued CPU time. For those of you from the non-UNIX land, that is time actually spent running. When the CPU switches to another process, the clock pauses, and when it switches back, the clock resumes. That isn’t 92 hours of wall time, but instead much more than that.

Libre Office seemed to be dominating the system with 35% utilization. For a 2-core unit, you need to half it to get the real picture.

Bottom line: stuff was slow and eating up the system. Well, what can you expect after leaving this machine on for so long? Current uptime reads 12 days. And considering that my MacBook Pro handles low power by going into hibernation (and I have driven it to that state many times), what else can you expect from beating on these apps so hard.

At my old job, we managed a 24×7 mission critical ops center. It was in that crucible that I learned the price of not only having your software do the right thing, but also having it on ALL THE TIME. It is one order of magnitude to get it right. It is a whole ‘nother order of magnitude to have it stay up all the time and remain bullet proof. We had a training lab that wasn’t really on our radar screen, because we rarely made adjustments to it. The only times we paid attention were when we would get a call from the trainer due to “half the machines are broken.” After responding a handful of times, we started to see a pattern. The PCs that had the hardest trouble appeared to have been up for more than two weeks. We told the sysadmin to essentially reboot each PC every two weeks as standard operating procedures, and the number of incidents fell dramatically.

So when you go out and build a system, be sure to put that on your radar screen. Any leg of your system, whether it is the kernel, a JVM, some other VM (like Python), or whatever, may have microscopic memory leaks or bugs in garbage collection that don’t get run into often. But when you start to run stuff all the time, then you can expect weird, why-is-that-failing-now junk to happen every few months. Sometimes you have no choice but to simply reboot the relevant machine and push on. Sometimes no amount of analysis will give you an answer. Welcome to 24×7 operations software.

Leave a Reply

Your email address will not be published. Required fields are marked *