We also had a few thousands of physical servers with about of terabyte of ram each.
You are right : we did see repaired errors, but we also saw (indirectly, and after testing ) unrepaired ones
But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!
Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.
You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.
So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.
And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)
I think we diverge on ‘making it go away in my book’.
When you’re the one having to debug all these bizarre things ( there were real money numbers involved so these things mattered ), over millions of jobs every day , rare events with low probability don’t disappear - they just happen and take time to diagnose and fix.
So in my book ecc improves the situation, but I still had to deal with bad dimms, and ecc wasn’t enough. We used not to see these issues because we already had too many software bugs, but as we got increasingly reliable, hardware issues slowly became a problem, just like compiler bugs or other elements of the chain usually considered reliable.
I fully agree that there are lots of other cases where this doesn’t matter and ecc is good enough.
Thanks for taking the time to reply !
But this is sort of the march of nines.
My knee jerk reaction to blaming ECC is "naaah". Mostly because it's such a convenient scapegoat. It happens, I'm sure, but it would not be the first explanation I reach for. I once heard someone blame "cosmic rays" on a bug that happened multiple times. You can imagine how irked I was on the dang cosmic rays hitting the same data with such consistency!
Anyways, I'm sorry if my tone sounded abrasive, I, too, have appreciated the discussion.