But.. in all my time operating servers over 3 decades, it's always been bad drivers, bad code and problematic hardware that's caused most of my headaches.
Have i seen ECC error correction in logs? yeah.. I don't advocate against it but, i've found for most people you design around multiple failure scenarios more than you design around preventing specific ones.
Take the average web app - you run it on 10 commodity systems and distribute the load.. if one crashes, so what. Chances are, a node will crash for many more reasons other than memory issues.
If you have an app that requires massive amounts of ram or you do put all of your begs in one basket, then ECC makes sense...
I just know i like going horizontal and I avoid vertical monoliths.