Hi,
You might read this first:
http://news.ycombinator.com/item?id=4057912
you can reply to that as well here if you want.
I think we're in very general agreement. Although you yourself did not say "it is likely to fail" or "you should expect that it will fail", this is exactly the sentiment I was replying to was.
Regarding your "all hardware possibly failing" and the example of a starter motor to imply that I am trying to disappear a technical problem with a semantic argument, I think I am (especially in that cousin reply) being quite a bit more specific.
Basically, when it comes to safety mechanisms that exist as a layer on top of a process and aren't necessary at all, I simply shouldn't have to even think about reinventing another safety mechanism on top of the safety mechanism. Get one that isn't defective.
A hard-drive isn't defective just because it fails: it's expected to. A RAID controller is also expected to fail...JUST NOT SILENTLY.
In the seatbelt example: should you even think about having to tie your seatbelt to the buckle with sturdy rope, for real safety in case the seatbelt just doesn't buckle when it seems to, or comes undone like a ripped shirt button at the slightest firm tug?
No. You should get an actual seatbelt.
Basically, the standard you hold a control layer to is different from the standard you hold an underlying process to.
It would be like the difference between your brake failing and your (for added safety) handbreak failing, which you only engage on top of the motor's brake anyway. If the motor brake fails you would start rolling (if you're on a bit of an incline). But you shouldn't even have to think about a hand-brake 'just failing' in the same condition.
Sure it can fail if you are being towed without being lifted, or whatever, in an extreme situation. But in a normal situation?
Basically, it is a difference of both category/kind AND of degree.
I am certainly not saying that a parking brake can never fail. I am not saying a raid controller can never fail.
I am saying that both of these, when they are layers on top of a normal process, should be out of sight, below your threshold of having to control for it. If they're not, you need to get a different one.
You don't get six insurance policies against the same earthquake possibility, hoping that they won't ALL decide to out-lawyer you or go bankrupt. You get real insurance that's properly reinsured. Check up on them. Find a real one.
Raid failure is fine. Silent raid failure is not fine.
(checksum failure with an exception is fine; checksum failure with no exception, warning, or error, just a random checksum produced - or a check randomly passing when the checksum doesn't match the one you provided, is not okay. fix your checksum, get a real one - don't build another layer on top, for the cases that your checksum is a randomized print statement or your insurance policy a monthly donation from you to a non-charitable organization that puts aside a portion to out-lawyer you with if you try to make a claim, with the rest spent on advertising or being their profit. That's not an insurance policy, that's a scam.)