I am curious about a couple of specific use cases that I can think of where this might affect me (and, most likely, would be common points of failure for others):
1. data in the MySQL tables (stored in memory) is corrupted. Would mysql crash? Indicate table corruption, and I could just reload it from disk? Write corrupted data to disk, and permanently trash my data collection?
2. large process that's running data analysis (say a big python process with tons of data in RAM). Would one of my variables (say an int with value 4) turn into another number? Would it become unreadable?
I appreciate the effort to explain this. I know, in theory, why ECC RAM is useful, but I have difficulty visualizing real world scenarios.
With MySQL any of those things you can happen. If you're lucky then only the cache is corrupted and then you can just reload from disk. If you're unlucky then the data got corrupted on its way to disk and the wrong data will be written to disk. If you are astronomically unlucky then the in memory machine code of MySQL got changed in such a way that it starts overwriting your entire disk with garbage. You should probably be more afraid of meteorites though. And of bugs in either your own or others' code.
ECC RAM reduces the probability of such a bit flip happening. That doesn't mean that they are eliminated entirely. So you have to do these two things in any case:
1. Bit flips can cause processes to misbehave/crash. So you want to have a way to detect and restart misbehaving/crashed processes.
2. Even with ECC RAM you want to do your own error correction for critical data (say a bank transaction log).
Here is an interesting paper that discusses the prevalence of DRAM errors and the effectiveness of ECC RAM:
DRAM Errors in the Wild: A Large-Scale Field Study -- http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
It would be interesting if somebody did an experiment where they artificially flipped bits of various software's memory to see what happens. I'd expect that in many cases it doesn't do any harm at all.
Let's use MySQL as an example. A bit flip in the memory which holds the code may cause it to crash. A bit flip in the 'metadata' could cause the table to become corrupted, potentially recoverably. A bit flip in the data itself could turn 'Travis' into 'Trbvis', which might go undetected depending on where it happened and which storage engine you are using.
The use of memory for OS page caching (less so in databases, which often use O_DIRECT and more so in other programs) means that arbitrary corruption could happen to pieces of disk data your program didn't even touch, if you touch data near them.
I've had non-ECC RAM systems destroy database tables leading to data loss.
Systems with ECC are either able to correct the error (and log it), or throw the alarm bells. Even discounting RAM bit flips, simple bad RAM can destroy your data. Having an ECC aware system (including getting Linux to check the EDAC or have the baseband monitor do so) has saved me many times from failing hardware.