October 06, 2004

Brain Surgery

Our web server for the NASA IBEAM Project, http://brain.cs.uiuc.edu, underwent surgery yesterday for an acute respiratory problem, the complications of which included high fever and loss of consciousness. The procedure went smoothly, though more loss of blood than expected was encountered (alas, on the part of the surgeon) than expected. Brian and Brain are resting comforatble, and a complete recovery is expected for both.

The procedure, I'm pleased to report, was neither "brain surgery" nor "rocket science".

Anyway, our server was crashing due to a faulty cooling fan. I rather suspected something of the sort when it would lock up and refuse to do anything, only to return to its old self after being powered down for a few minutes. Still, there was no obvious fan problem to be seen once the covers came off.

Howerver, the BIOS Hardware status page told the take quite vividly. I could see the fan speed for the CPU0 drop from 5400 RPM down to 500, and then to zero, while the CPU temperature rose to nearly 212 F. A smoking gun; a nearly toasted Athlon.


I mention this only because it reinforced something I've been thinking while building test cases of late: why can't more software components work this way? I want code that does more of this kind runtime validation and status reporting in more of the objects I build and use.

Now, this is surely not a new idea; just one I'm personally becoming quite sold on / smitten with of late. Why can't more of these kinds of monitoring and instrumentation facitlites be built into more of our components? As with motherboards, is this a by-product of technical maturity?

Posted by foote at October 6, 2004 04:32 AM