I had memory stability issues which would immediatly show under Prime95 (less than 1 minute) but pass hours of Linpack.
I guess I'd probably be okay with that if the only thing I ever used the computer for was gaming.
Meta quickly detects silent data corruptions at scale - https://news.ycombinator.com/item?id=30905636 - April 2022 (95 comments)
Silent Data Corruptions at Scale - https://news.ycombinator.com/item?id=27484866 - June 2021 (12 comments)
None of this is free, and there are both hardware and software solutions for mitigating various categories of risk. It is explicitly modeled as an economics problem i.e. how does the cost of not mitigating a risk, if it materializes, compare to the cost of minimizing or eliminating it. In many cases, the optimal solution is unintuitive, such as computing everything twice or thrice and comparing the results rather than using error correction.
As for adding ECC within the CPU, I think that would require you to essentially have a second CPU in parallel to compare against.
2 will tell you if they diverge, but you lose both if they do. 3 let's you retain 2 in operation if one does diverge.
From my point of view, this technology problem may be interesting academically (and good for pretending to be important in the hierarchy at those companies) but a non-issue at scale business-wise in modern data centers.
Have a blade that once in a while acts funny? Trash and replace. Who cares what particular hiccup the CPU had.
I've worked on similar stuff in the past at Google and you couldn't be more wrong. For example, if your CPU screwed up an AES calculation involved in wrapping an encryption key, you might end up with fairly large amounts of data that can't be decrypted anymore. Sometimes the failures are symmetric enough that the same machine might be able to decrypt the data it corrupted, which means a single machine might not be able to easily detect such problems.
We used to run extensive crypto self testing as part of the initialization of our KMS service for that reason.
Compressing data, with the increased information density & greater number of CPU instructions involved, seems obviously to increase the exposure to corruption/ bitflips.
What I did wonder was why compress the filesize as an exponent? One would imagine that representing as a floating-point exponent would take lots of cycles, pretty much as many bits, and have nasty precision inaccuracies at larger sizes.
That's not a technical error, they mean SRAM in the CPU itself. You're right about gcj but that's kind of a moot point when investigating some reproducible CPU bug like this. The paper mentions all the acrobatics they went through when trying to find the root cause but if gcj would have been practical then it also would have been immediately clear if the gcj output reproduced the error or not. If it didn't reproduce, no big deal, try another approach. You might be right about it being easier to root cause with gdb directly but I'm not so sure. Starting out, you have no idea which instructions under what state are triggering the issue so you'd be looking for a needle in a haystack. A crashdump or gdb doesn't let you bisect that haystack so good luck finding your needle.
It all depends how good you are with x64 assembly. If you are good enough, you can easily deduce what the instructions at the location do, and can potentially simply copy-paste into an asm file, compile it and check result. Would be much faster to me.
Bluntly speaking, people who are not familiar with low-level debugging make an honest and succesful attempt to investigate a low-level issue. A seasoned kernel developer or reverse engineer would have just used gdb straight away.