Monday, 25 October 2010

The secret known to every IT department


4 comments:

Sergej said...

Probability that you will have to fix the same problem again: 1. Of course, I come from the other side of the story: I write software. If you are IT or a user, and stuck with treating the stuff as a black box, maybe rebooting and hoping that the next revision will have a fix is the only option.

I suppose this gets into the question of when do you (as a company) want to release? The risk of too much testing is losing market share to someone who released earlier; Microsoft got big in the 70s--90s by being there first. The risk of too little is... well, earning Microsoft's reputation for reliability. Risks can be reduced by having brilliant design from the beginning, but despite the way we software types are treated these days, good engineers are not that common.

jayessell said...

Lead up:

http://www.collectedcurios.com/SA_0429_small.jpg

http://www.collectedcurios.com/SA_0430_small.jpg

http://www.collectedcurios.com/SA_0431_small.jpg

http://www.collectedcurios.com/SA_0432_small.jpg

Punchline:

http://www.collectedcurios.com/SA_0433_small.jpg

Chris Lopes said...

Turning it on and off works on the assumption that the user (or his/her next of kin) did something really stupid to the current instance of the system. Sometimes that can be a better than even bet, but not always.
In any case, the fact that the system was not able to deal with the stupidity should be considered a design flaw.

Sergej said...

Yes, Chris Lopes, but as with the US and international dislike, it isn't always because of something you did. For instance, there is a condition called a memory leak. Each second, or each time a certain action is performed (say, processing a certain type of command by network), a few of the bytes that are allocated, are not returned to free memory. Over the course of hours or days the program's memory footprint bloats. Eventually, for a variety of reasons, this may cause it (or another program) to crash. There are programs that help find this condition, at the cost of a long, slow run, and language-based things that are designed to prevent it, at the cost of the program running more slowly. You might want to go through several iterations using the language-based crutches before rewriting as if you knew what you were doing, for speed.

Once you start running with more than one thread or process, things become more complicated. Two processes might need to grab the same pair of resources. Each grabs one and starts waiting for the other to become free---but the other resource has just been grabbed by the other process, so the whole thing freezes! The window of time that is vulnerable to this may be very small, so the program will usually run for hours before this condition is encountered. I once had to debug just such a thing, which was deadlocking after approximately two hours' run-time. With time expensive, because there was a real, hardware, CT scanner attached. Took me three days without leaving the office to find it. At the end, my boss told me: "next time, I want you to work smarter, not harder". (He totally said this! In just these words!) I was glad when I found a different employer and got away from this clown.