Fictional Plumbing Problems As A Tortured Analogy For Software Engineering (opens in new tab)

(latentcontent.net)

19 pointsantubbs14y ago16 comments

16 comments

This may be a tortured analogy, but it boils down into a basic problem:

1. You know there's a bug

2. You can't reproduce it

Several next steps come to mind:

1. Hire an outside expert who's dealt with this sort of thing before. They may be able to theorize what's going on and come up with a solution.

2. Install measures that don't prevent the problem but prevent the damage. For example, an emergency failsafe that shuts down the system or relieves the pressure when the incident occurs, thereby preventing the damage. This is why we electricity has fuseboxes! Error management is sometimes the only option, because 100% error prevention is impossible.

3. Install monitoring that tracks a lot more details then you are currently getting. When the next error occurs, you will know a lot more and may have the information needed.

Edit: What's the name of the theory in networking that 100% error prevention is not possible, so error handling is the only option? There was a great article on HN about it a few years back.

drone14y ago

My experience has almost always led to #3 being the most workable solution, but not a perfect one.* #2 should be incorporated into any project, but it presumes that you know all possible ramifications of incorrect operation. An electrical breaker works because complete non-operation is generally better than death. For many software companies, complete non-operation is a precursor to death.

#1 is almost never a good solution, namely, the amount of time it would take for them to become familiar enough with the codebase to not aggravate your existing engineers would exceed several iterations of #3, and also because I've rarely met an outside expert whose solutions didn't involve re-writing everything to meet their expectations of "correct implementation," this could be a sample selection problem on my part, however.

* - How do you know that you are monitoring the correct component? This path usually leads to multiple monitoring development tasks as you find where you thought the problem was sourced was a in fact symptom, and you continue adding more monitoring options as you get closer to the source. This is why I almost always add an insane level of logging to any application, and control the verbosity through runtime controls.

dredmorbius14y ago

Brief non-operation (reboot / service restart) is often better than a prolonged outage. Particularly where SLAs are set to create an expectation and acceptance of this, and where redundancy exists.

I'm thinking too that there's a feedback process at work here, and some sort of damping mechanism would help with that.

drone14y ago

Agreed, and many architectures are designed to have components "transparently fail" without impact to overall operation. When you have forced failures, feedback/damping is absolutely required. However, (my experience dictates) that most such failures are unplanned and unknowable at the outset, and you can only dampen conditions which are predictable.

stcredzero14y ago

Maybe programming is tortured analogies all the way down? (Not really, but there are some over-engineered code bases that feel like it.)

dredmorbius14y ago

No, it's turtles.

Tortured turtles.

stcredzero14y ago

Actually, it's tortured turtle analogies all the way down.

dredmorbius14y ago

Well played, sir.

dsr_14y ago

IN the apartment building, we have a known problem, a high severity attached to it, an unacceptably high incident rate, and no idea of the exact conditions necessary to replicate it.

At this point I would do two things:

1. log all the things.

2. find me my top QA person, the one who can find bugs that nobody has yet reported. Put her on it.

OK, everybody knows that logging is good. And everyone knows that QA is good.

What I have found, though, is a number of companies who think that QA is best done by the developer who wrote the feature... and I think they are absolutely wrong in every sense, except possibly short-term economics. Having someone do QA who has none of their ego invested in the code is essential.

tomjen314y ago

The problem here is that there is very little, if anything, as complicated as software. Preventing leaks like in the example is not that difficult -- you put in pipes that can handle a lot more than the required load, because it is unacceptably expensive to have them burst and the better pipes are not that much more expensive (putting them in is).

jaylevitt14y ago

Nope. I actually live in a luxury high-rise, and while the pipes don't leak, the pressure and temperature is about as bad as the OP describes.

No bidets, though.

duwease14y ago

I was hoping for an analogy to explain how difficult it is to estimate long-term programming work due to unexpected "black swan" details popping up as you get into the work that add considerable effort to the project. It's a situation I find I need to explain often, and in layman terms, so a perfect analogy would be great...

antubbsOP14y ago

I think that's been beaten to death with http://www.quora.com/Engineering-Management/Why-are-software...

scotty7914y ago

Redo the bathrooms using different layout and components.

alainbryden14y ago

"If at first you don't succeed, refactor."

drone14y ago

Or, you can pivot... Turn the bathrooms into fishtanks!

j / k navigate · click thread line to collapse

16 comments

GavinB14y ago

This may be a tortured analogy, but it boils down into a basic problem:

1. You know there's a bug

2. You can't reproduce it

Several next steps come to mind:

1. Hire an outside expert who's dealt with this sort of thing before. They may be able to theorize what's going on and come up with a solution.

3. Install monitoring that tracks a lot more details then you are currently getting. When the next error occurs, you will know a lot more and may have the information needed.

Edit: What's the name of the theory in networking that 100% error prevention is not possible, so error handling is the only option? There was a great article on HN about it a few years back.

drone14y ago

dredmorbius14y ago

Brief non-operation (reboot / service restart) is often better than a prolonged outage. Particularly where SLAs are set to create an expectation and acceptance of this, and where redundancy exists.

I'm thinking too that there's a feedback process at work here, and some sort of damping mechanism would help with that.

drone14y ago

stcredzero14y ago

Maybe programming is tortured analogies all the way down? (Not really, but there are some over-engineered code bases that feel like it.)

dredmorbius14y ago

No, it's turtles.

Tortured turtles.

stcredzero14y ago

Actually, it's tortured turtle analogies all the way down.

dredmorbius14y ago

Well played, sir.

dsr_14y ago

IN the apartment building, we have a known problem, a high severity attached to it, an unacceptably high incident rate, and no idea of the exact conditions necessary to replicate it.

At this point I would do two things:

1. log all the things.

2. find me my top QA person, the one who can find bugs that nobody has yet reported. Put her on it.

OK, everybody knows that logging is good. And everyone knows that QA is good.

tomjen314y ago

jaylevitt14y ago

Nope. I actually live in a luxury high-rise, and while the pipes don't leak, the pressure and temperature is about as bad as the OP describes.

No bidets, though.

duwease14y ago

antubbsOP14y ago

I think that's been beaten to death with http://www.quora.com/Engineering-Management/Why-are-software...

scotty7914y ago

Redo the bathrooms using different layout and components.

alainbryden14y ago

"If at first you don't succeed, refactor."

drone14y ago

Or, you can pivot... Turn the bathrooms into fishtanks!

j / k navigate · click thread line to collapse