DEFCON Talk: https://www.youtube.com/watch?v=aT7mnSstKGs
https://blog.mozilla.org/data/2022/04/13/this-week-in-glean-...
It was something about being more likely to be a human typo or a config change that rolled out to a bunch of machines. The statistics didn't add up, and it wasn't plausible that bit flips caused it.
I have run queries at large companies and found mistakes most easily explained as bitflips in domain names written to disk. Imagine an environmental variable configuring the use of a proxy without proper whitelists and it's not unimaginable to me that a production machine would be able to speak to machines on the internet at large.
I am open to the idea that what I think is happening might not be the mechanics of what is happening, but I find the talk believable, not based on theory, but actually seeing persisted (and non-persisted) bit flips in domain names queried from data warehoused logs at world scale companies.
https://en.wikipedia.org/wiki/Soft_error
I was in the audience of the talk. All devices should be required to use ECC because it's a security risk. Not as much as http:// era, but silent corruption across networks and systems is a thing.
Space weather is solar activity.
From the link in your comment: "The average rate of cosmic-ray soft errors is inversely proportional to sunspot activity. That is, the average number of cosmic-ray soft errors decreases during the active portion of the sunspot cycle and increases during the quiet portion."
On the other hand, there are many, many systems out there that don't have ECC, nor do they have the option to have ECC. While every video on Youtube wants us to believe that the difference between 580 and 585 frames per second in some silly game or another makes all the difference in the world, for me the difference between a system that runs 10% slower and one that crashes in the middle of the night is actually significant. I test all my systems at a certain memory frequency, then back off to the next slower frequency just to be sure.
That doesn't stop memory errors from happening, but most systems have lived their entire lives without having random crashes or random segfaulting. I consider that worthwhile.
I'd like to think that as NAND continues to scale up in capacity and lower in cost, that we'll see some real shakeup to filesystems and storage where self-healing mass storage can be genuinely commoditized -- not something that's only accessible to businesses (and computing enthusiasts) due to cost and complexity.
Once I coded a shell script that verified all my photos, but I don’t bother with that anymore. I just back everything up, and if there’s ever a problem, the parity files provide an additional safety net.
I built a home server last year with an ASRock X570M Pro4 [0] with a Ryzen 4750 PRO (which I had to source OEM from Aliexpress as it's not sold direct). I'm not sure what's the current situation, but the only RAM I could find for it was the Kingston Server Premier KSM32ED8 [1], and the ECC premium was not fun to pay.
[0] https://www.asrock.com/MB/AMD/X570M%20Pro4/index.asp
[1] https://www.kingston.com/en/memory/server-premier/ddr4-3200m...
got a HP that have both an AMD pro apu and ddr5 slots, with no soldered ram. i e. all the requirements.
it was $500 to 1500 depending on configuration. then 16 or 32gb of ecc sodimm runs over $2000 for regular consumers! and that's if you can find them in stock!
Like my desktop just froze, and then it never happened again. It only makes sense if it was a random bit-flip.
The actual RAM speed never mattered, you can't tell the difference with 150 FPS and 165 FPS (even though my screen's refresh rate is 280Hz)
Additionally the on-chip EEC of DDR5 won't report the errors to your OS. ECC memory errors when corrected can be handled by the OS, and you'll even be informed of the uncorrectable 2 bit errors.
Want to protect for 2-bit errors? Make sure your platform has support for ECC-chipkill.
HN readers seem to have a skewed idea of how useful ECC is while pretending the downsides don't exist. Not everyone is primarily using their system as a workstation.
Back then it was recommended to run a defragger every so often, so I set up a cron job to run it every Saturday night or something like that. The net result was that every file block that got moved made a trip through memory with some small probability of getting corrupted. Often the errors were in files that weren't used that often so I didn't immediately notice. The net result is that after many months of this, I started noticing PDF files that were corrupted, or mp3 files that would hiccup in the middle even though it used to play perfectly before. Sadly, I had ripped my 500-ish CD collection and then had gotten rid of the physical CDs.
I noticed (after some windows bluescreen) on memtest that the memory is showing some errors. Ordered another 16GB pair, replaced it and.... the problem persisted.
Suspecting something with motherboard I just chalked it to something with mobo and pretty much said "well I'm not replacing mobo now, it will have to wait for next hardware refresh. Gaming PC so no big deal. And now I had 32 GB of RAM in PC.
Weirdly enough, problem only happened when running on multi-core memory test.
Cue ~1 year after and my power supply just... died. Guessing bad caps I just ordered another and thought nothing of it. On a whim I ran memtest and....
nothing. All fixed. Repeated few times and it was just fine, no bluescreen for ~ 2 years now too.
I definitely want to get next machine with ECC but the DDR4 consumer ECC situation looks... weird. I'm not sure whether I should be happy with on-chip ECC, I'd really prefer to have whole CPU-memory pipe ECCed
Secondly, once a good list of known faulty memory addresses had been created by memtest, one can tell the operating system not to use them. Then you can keep using your old hardware without the reliability problems. Although, it is possible that further areas of memory will subsequently fail, and without ECC, you'll still be vulnerable to random (cosmic ray-induced) bit flips.
There same ones, or random new machines every time?
https://blog.codinghorror.com/building-a-computer-the-google...
Gigabyte really did mean DDR4-3200 was the limit for Pinnacle Ridge and older AMD cpus.
PSU is something I never cheap out on. Always pays for itself in the end. A bad PSU can kill your whole system.
And if there was cause for alarm, I would think long and hard about imaging from the original computer at all. With certain failure modes in drives, just reading could cause more corruption; each failed attempt could lose data.
But yeah, happy you did it this way in the end, because I learned a ton from the resulting blog post!
My iMac Pro has it as well.
Even with ECC, it's incredibly hard to know that a given one-off issue isn't a memory error, because even ECC can't detect 100% of memory issues. But without ECC, it's also nearly impossible to know if something is a memory error. If it's bad RAM, the same address will likely continue to exhibit bad behavior, but if it's a solar flare, you're never going to know the difference; you will just get incorrect behavior that may or may not crash, and it will be completely impossible to reproduce.
One big reason you don't hear it as much is there are not nearly as many data centers filled with Macs. There are definitely a few, and I bet if you got an experience report from them, they could give some idea of how visible memory errors are on Macs (although it's hard, because again, if you don't have ECC, there's not really a good way to know if something is a memory error; you can only really postulate.)
Without it the corruption is silent. Then this kind of thing happens:
https://news.ycombinator.com/item?id=35026440
Which is another reason not to solder the storage either.
Suppose you have a system board with bad soldered memory and you want to copy your data off of it onto the new one. Well, the memory is flipping random bits as it's copying, but the flash chips are permanently attached to the same board as the bad memory.
Otherwise it would have been just a support ticket; now it's something worse.
To be clear, I do not believe that the tools are at fault - rather, the SATA/SAS/IDE controllers have a different design goal, and software tools can only do so much.
Tools like DeepSpar (HW+SW), PC-3000 (also HW+SW) allow for a scary level of nitty-gritty access to HW, including flashing SSD/HDD controller FW in case in went pear-shaped), but for data recovery - be it in a forensic context, or in a context of retrieving important irreplaceable data, I have always had a nerd-lust for those tools. Used them at a previous job, but can't ever justify the price for personal and very infrequent use. :)
I just got through a round of overclocking my memory. Yes, heat does.
>tRFC is the number of cycles for which the DRAM capacitors are "recharged" or refreshed. Because capacitor charge loss is proportional to temperature, RAM operating at higher temperatures may need substantially higher tRFC values.
https://github.com/integralfx/MemTestHelper/blob/oc-guide/DD...
If anyone has the link, it's missing from my collection...
Upgrade to DDR5 ram the latest standard which has on-die ECC memory but is not as good at spotting bit flips unlike proper ECC memory with a separate extra data correction chip.
https://en.wikipedia.org/wiki/DDR5_SDRAM#:~:text=Unlike%20DD....
Whilst Proper ECC ram chips and motherboards exist, I'm surprised that a cheaper but equally as good as Proper ECC solution doesn't exist although I know some would argue that DDR5 is a step in the right direction of a marathon.
I guess the markets know best and chase the numbers, assuming they are also using Proper ECC memory, binary coded decimal and not floating point arithmetic which introduces errors, something central banks have been using for decades?
https://en.wikipedia.org/wiki/Floating-point_error_mitigatio...
“There still exist non-ECC and ECC DDR5 DIMM variants; the ECC variants have extra data lines to the CPU to send error-detection data, letting the CPU detect and correct errors that occurred in transit.”
https://www.anandtech.com/show/18732/asrock-industrial-nucs-...
It happens quite often as a result of dust in the contacts when the memory was installed or weak solder on the chips or sockets or bad capacitors etc.
None of which is that likely on machines in good working order, but many are not. And you can go from one to the other at any time as a result of a power spike or a cooling failure.
> This is really unlikely, though, and anything not mission-critical will no longer need the extra ECC computation on the CPU-side.
ECC computation is done in hardware anyway
I think you can say that because people are not routinely monitoring their surroundings for ionizing radiation.
If this were to change, I think we can start to identify some of those military locations which could be interfering with equipment, that would then expose the weakness of DDR5.
Unpopular-opinion counterpoint - the odds of this actually happening are vanishingly unlikely. Many file formats have built-in integrity checks and tons of redundancies and waste. I wouldn't want to risk handling extremely valuable private keys or conducting high value cryptocurrency transactions or something, I suppose, on a machine without ECC memory, but that just doesn't really come up in most knowledge worker or end consumer scenarios.
The odds of actually getting bit by this in a way that matters to you are really low, which is why nobody cares.