The incident that brought this home to a lot of people was at either NCAR or UCAR, both near Boulder. Whichever it was, they were installing a new system - tens of thousands of nodes - and had not been careful about the EDAC settings. Therefore, EDAC wasn't running often enough, and wasn't catching those single-bit errors. Therefore^2, uncorrectable errors were bringing down nodes constantly. According to rumor, this caused a huge delay and almost torched the entire project. It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).