Google's initial strategy (c. 2000) around this was to save a few bucks on hardware, get non-ECC memory, and then compensate for it in software. It turns out this is a terrible idea, because if you can't count on memory being robust against cosmic rays, you also can't count on the software being stored in that memory being robust against cosmic rays. And when you have thousands of machines with petabytes of RAM, those bitflips do happen. Google wasted many man-years tracking down corrupted GFS files and index shards before they finally bit the bullet and just paid for ECC.
Did they ( Google ) or He ( Craig Silverstein ) ever officially admit it on record? I did a Google search and results that came up were all on HN. Did they at least make a few PR pieces saying that they are using ECC memory now because I dont see any with searching. Admitting they made a mistake without officially saying it?
I mean the whole world of Server or computer might not need ECC insanity was started entirely because of Google [1] [2] with news and articles published even in the early 00s [3]. And after that it has spread like wildfire and became a common accepted fact that even Google doesn't need ECC. Just like Apple were using custom ARM instruction to achieve their fast JS VM performance became a "fact". ( For the last time, no they didn't ). And proponents of ECC memory has been fighting this misinformation like mad for decades. To the point giving up and only rant about every now and then. [3]
[1] https://blog.codinghorror.com/building-a-computer-the-google...
https://semiengineering.com/what-designers-need-to-know-abou...
There has been a lot of debate regarding this that was summarised in this post -
On-die ECC is going to be a standard feature for DDR5. I'm not aware of any indication that anyone has implemented on-die ECC for DDR4 DRAM, and Hynix at least has made clear statements that on-die ECC is new for their DDR5 and was not present in their DDR4.
There are many various DRAMs in a server (say, for disk cache). Has Google or anyone who operates at a similar scale seen single bit errors in these components?
When America Online was buying EV6 servers as fast as DEC could produce them, they used to see about about 1 double bit error per day across their server farm that would reboot the whole machine.
DRAM has only gotten worse--not better.
Bit flips (for all reasons) occur in buses, registers, caches, etc. Anything that has state can have state changed incorrectly.
This is why filesystems like ZFS exist and storage formats have pervasive checksums.
https://www.newyorker.com/magazine/2018/12/10/the-friendship...
Edit: found one with a quick search. https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...
And https://www.researchgate.net/publication/262273269_Bitsquatt...
Random bit flips happen on client machines and on routers.
If there are enough requests for a domain name, some of those requests will be subject to one one of those bit-flips.
“I’ve heard of defensive programming, but never adversarial memory.” — Ben Gomes
> I've never thought of defensive programming in terms of adversarial memory.
Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.
Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.
Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".
1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload
2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.
The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?
I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.
There is an OS that pretty much fits the bill here. There was a show where Andrew Tanenbaum had a laptop running Minix 3 hooked up to a button that injected random changes into module code while it was running to demonstrate it's resilience to random bugs. Quite fitting that this discussion was initiated by Linus!
Although it was intended to protect against bad software I don't see why it wouldn't also go a long way in protecting the OS against bitflips. Minix 3 uses a microkernel with a "reincarnation server" which means it can automatically reload any misbehaving code not part of the core kernel on the fly (which for Minix is almost everything). This even includes disk drivers. In the case of misbehaving code there is some kind of triple redundancy mechanism much like the "quorum" you suggest, but that is where my crude understanding ends. AFAIR Userland software could in theory also benefit provided it was written in such a way to be able to continue gracefully on reloading.
The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.
It’s so easy to chalk these kind of errors to other issues, a little corruption here, a running program goes bezerk there- could be a buggy program or a little accidental memory overwrite. Reboot will fix it.
But I ran many thousands of physical machines, petabytes of RAM, I tracked memory flip errors and they were _common_; common even in: less dense memory, in thick metal enclosures surrounded by mesh. Where density and shielding impacts bitflips a lot.
My own experience tracking bitflips across my fleet led me to buy a Xeon laptop with ECC memory (precision 5520) and it has (anecdotally) been significantly more reliable than my desktop.
I love that AMD doesn't intentionally break ECC on its consumer desktop platforms and upgraded to the Threadripper in 2017.
We only use Xeons on developer desktops and production machines here precisely because of ECC. It's about 1 bit flip/month/gigabyte. That's too much risk when doing something critical for a client.
I’ve always believed that, ECC aside, DRAM made intentionally with big cells would be less prone to spurious bit-flips (and that this is one of the things NASA means when they talk about “radiation hardening” a computer: sourcing memory with ungodly-large DRAM cells, willingly trading off lower memory capacity for higher per-cell level-shift activation-energy.)
If that’s true, then that would mean that the per-cell error rate would have actually been increasing over the years, as DRAM cell-size decreased, in the same way cell-size decrease and voltage-level tightening have increased error rate for flash memory. Combined with the fact that we just have N times more memory now, you’d think we’d be seeing a quadratic increase in faults compared to 40 years ago. But do we? It doesn’t seem like it.
I’ve also heard a counter-effect proposed, though: maybe there really are far more “raw” bit-flips going on — but far less of main memory is now in the causal chain for corrupting a workload than it used to be. In the 80s, on an 8-bit micro, POKEing any random address might wreck a program, since there’s only 64k addresses to POKE and most of the writable ones are in use for something critical. Today, most RAM is some sort of cache or buffer that’s going to be used once to produce some ephemeral IO effect (e.g. the compressed data for a video frame, that might decompress incorrectly, but only cause 16ms of glitchiness before the next frame comes along to paper over it); or, if it’s functional data, it’s part of a fault-tolerant component (e.g. a TCP packet, that’s going to checksum-fail when passed to the Ethernet controller and so not even be sent, causing the client to need to retry the request; or, even if accidentally checksums correctly, the server will choke on the malformed request, send an error... and the client will need to retry the request. One generic retry-on-exception handler around your net request, and you get memory fault-tolerance for free!)
If both effects are real, this would imply that regular PCs without ECC should still seem quite stable — but that it would be a far worse idea to run a non-ECC machine as a densely-packed multitenant VM hypervisor today (i.e. to tile main memory with OS kernels), than it would have been ~20 years ago when memory densities were lower. Can anyone attest to this?
(I’d just ask for actual numbers on whether per-cell per-second errors have increased over the years, but I don’t expect anyone has them.)
Think of the number of events that can flip a bit. If you make bits smaller, you get a modestly larger number of events in a given area capable of flipping a bit, spread across a larger number of bits in that area.
That is, it's flip event rate * memory die area, not flip event rate * number of memory bits.
In recent generations, I understand it's even been a bit paradoxical-- smaller geometries mean less of the die is actual memory bits, so you can actually end up with fewer flips from shrinking geometries.
And sure, your other effect is true: there's a whole lot fewer bitflips that "matter". Flip a bit in some framebuffer used in compositing somewhere-- and that's a lot of my memory-- and I don't care.
Otherwise, I would think that an unlikely event becoming 1000x more likely by sheer numbers would have warped your perception.
I believe that hardware reliability is mostly irrelevant, because software reliability is already far worse. It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail, what matters is that this failure is handled gracefully.
The libraries we maintain (1) are responsible for a non-trivial part of Facebook's overall compute footprint, (2) should basically never fail of their own accord, and (3) have pretty good error monitoring. So my team is operating what is effectively (among other things) a very sensitive detector for hardware failure.
And indeed we see examples all the time of blobs that fail to decompress, and usually when we dig in we find that the blob is only a single bit-flip away from a blob that decompresses successfully into a syntactically correct message. I can't share numbers, but, off the top of my head, I think it's the largest source of failures we see. It happens frequently enough that I wrote a tool to automate checking [0].
So yes. It happens. Pretty frequently, in the sense that if you're doing xillions of operations a day, a one-in-a-xillion failure happens all the time.
[0] https://github.com/facebook/zstd/tree/dev/contrib/diagnose_c...
Nevertheless, anyone who uses the computer for anything else besides games or movie watching, will greatly benefit from having ECC memory, because that is the only way to learn when the memory modules become defective.
Modern memories have a shorter lifetime than old memories and very frequently they begin to have bit errors from time to time long before breaking down completely.
Without ECC, you will become aware that a memory module is defective only when the computer crashes or no longer boots and severe data corruption in your files could have happened some months before that.
For myself, this was the most obvious reason why ECC was useful, because I was able in several cases to replace memory modules that began to have frequent correctable errors, after many years with little or no errors, without losing any precious data and without downtime.
> It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail
Except that a bitflip can go undetected. It may crash your software or system, but it also may simply leak errors into your data, which can be far more catastrophic.
I can't give my source, but its far higher than most people think. Just pay the money.
Looks like `mcelog --client` might be a starting place? Feed that into your metrics pipeline and alert on it like anything else...
Screen, Wi-Fi, and to a much lesser extent (unless under load) the CPU are the most major culprits of low battery life.
Of all the things to be worried about, like OS bugs, bad hardware configuration, etc. bad memory is one of those really troubling things. You look at the code and say "it's can't make it here, because this was set" but when you can't trust your memory you can't trust anything.
And as the timeline goes to infinity, you may also get one of these reports and be asked to fix it... good luck.
See "Your computer is broken". They essentially inserted a stress test into the game that verified if the hardware was still doing calculations correctly, and if not, inform the user.
It is incomprehensible that there are still NAS devices being sold without ECC support.
Synology took a step in the right direction to offer prosumer devices with ECC but it is not really advertised as such. It is actually difficult to find which do have ECC and which ones don't.
I just look it up because if it was true it would have been news to me. Synology have been known to be stingy with Hardware Spec. But none of what I called Prosumer, the Plus Series have ECC memory by default. And there are "Value" and "J" Series below that.
Edit: Only two model from the new xx21 series using AMD Ryzen V has ECC memory by default.
- Edit -
Also, bit flips in the non-ECC memory are _the_ cause of the "bitrot" phenomenon. That is when you write out X to a storage device, but you get Y when you read it back. A common explanation is that the corruption happens _at rest_. However all drives from the last 30+ years have FEC support, so in reality the only way a bit rot can happen is if the data is damaged _in transit_, while in RAM, on the way to/from the storage media.
So, if you ever decide if to get an ECC RAM, get it. It's very much worth it.
Problems can definitely happen in the IO controller, RAID controller, cable, and disk controller. AFAIK all of these were seen and motivations for the existence of ZFS. One of their biggest insistence was that drives are universally lying bastards and should not be trusted any further than they can be thrown.
https://www.anandtech.com/show/15912/ddr5-specification-rele...
https://media-www.micron.com/-/media/client/global/documents...
When the value add feature becomes a necessity, it’s not a value add any more.
It seems redundant to have every module come with its own checking hardware.
For memory controller, parity/ECC/chipkill/RAIM usually involved simply adding additional memory planes to store correction data. I believe the rare exceptions are fully buffered memories where you have effectively separate memory controller on each module (or add-in card with DIMMs)
And now you have 8 bits of ecc per 32 data versus older DDR having 8 bits of ecc per 64 data. Hence the cost for dimm-wide ecc is going up.
But these days with the RAM density being so high and bitflipping attacks being more than a theoretical threat it seems like there's really no good reason not to switch to ECC everywhere.
Because ECC means Error Correcting Code, by definition, any board that claims ECC support must actually correct the errors. The ECC codes used now, with 8 extra bits for each 64 data bits, correct any 1-bit error and detect any 2-bit errors.
Very old computers (25 years old, or more) used parity instead of ECC and they just detected any 1-bit error (and any errors with an odd number of flipped bits), without being able to correct the errors.
It's faster too.
But i don‘t know how relevant these metrics from 2009 are. Did memory got better or worse compared to 2009 for bit flips?
When I first tried to replicate the row hammer attack I was not getting any results. Turns out I was doing this on ECC. On non ECC memory the same test easily replicated the row hammer attack.
Its.
There, I finally corrected Linus Torvalds in something. :))
Now the point about internally doing ECC is an interesting one, could be a way out of this mess. And apparently ECC is more available in AMD land
The first time I wrote "your" instead of "you're" in English I thought it was quite a milestone!
Yes it is. The problem is they dont really advertise it. I'm not certain but it might even be standard on AMD chips, but if they dont say so and board makers are also unclear, who knows...
(Note: EdDSA is still much much better than ECDSA, most notably because it's easier to implement correctly.)
Support was only interested if their built-in memory tester, which even on it's most thorough, would only run for ~3 hours, would show errors, which it wouldn't. IIRC, the BMC was logging "correctable memory errors", but I may be misremembering that.
"We've run this test on every server we've gotten from you, including several others that were exactly the same config as this, this is the only one that's ever thrown errors". Usually support is really great, but they really didn't care in this case.
We finally contacted sales. "Uh, how long do we have to return this server for a refund?" All of a sudden support was willing to ship us out a replacement memory module (memtest86 identified which slot was having the problem), which resolved the problem.
They were all too willing to have us go to production relying on ECC to handle the memory error.
Good call in not accepting this. Even ignoring the possibility you have a double-bit error that causes a crash, or a triple-bit error that maybe can't be detected, frequent ECC errors are problematic. I've encountered machines that consistently ran my software horribly slowly. I don't remember specifics, but let's say at least 100X latency of other machines for similar operations. When I dug in, I found these machines had a huge amount of correctable memory errors. The correction apparently degrades performance significantly. I'm not sure exactly why, but I guess there's an MCE trap to report the memory error, and perhaps that path is slow.
But you’ve got it backwards about the incentives. A manufacturer has less incentive to deliberately ship a defective part in the case of ECC modules. If the modules consistently log ECC errors, they can easily be identified and returned under warranty to the manufacturer. A consumer is much less likely to identify an intermittent problem with a non-ECC part.
Not if you're using a typical 72-bit SECDED code[0].
You have two error indicators: a summary parity bit (even number of errors: 0,2,etc vs odd number of errors: 1,etc), and a error index: 0 for no errors, or the bitwise xor of the locations each bit error.
For a triple error at bits a,b, and c, you'll have summary parity of 1 (odd number of errors, assumed to be 1), and a error index of a^b^c, in the range 0..127, of which 0..71[1] (56.25%, a clear albeit not overwhelming majority) will correspond to legitimate single-bit errors.
0: https://en.wikipedia.org/wiki/Hamming_code#Hamming_codes_wit...
1: or 72 out of 128 anyway; the active bits might not all be assigned contiguous indexes starting from zero, but it doesn't change the probability and it's simpler to analyse if summary is bit 0 and index bit i is substrate bit 2^i.
However I don't remember if there are provisions for ECC checking in case there are some dedicated refresh commands. I hope so, but I'm not sure.
https://www.tomshardware.com/reviews/ecc-memory-ram-glossary...
It would be a lot slower than real ECC but it could just be used for operations that would be especially vulnerable to bit flips. It would also not know for certain if the memory segment of data or the memory segment holding the checksum was corrupted besides their relative sizes (checksum is much smaller so more unlikely to have had a bit flip in it's memory region).
There's also the obvious tactic of just storing every logical 64-bit word as 128 bits of physical memory, which gives you room for all kinds of crap[1], at the expense of halving your effective memory and memory bandwidth.
0: This is extremely cheap since you're loading a 64- vs 128-bit value, with no extra round trip time and still fits in a cache line, so you're likely just paying extra memory use from larger page tables.
1: Offhand, I think you could fit triple or even quadruple error correction into that kind of space (there's room for eight layers of SECDED, but I don't remember how well bit-level ECC scales).
Bit flips happen and are real. I really wish ECC was plentiful and not brutally expensive!
Finding a memory upgrade seems difficult though.
It is filled to the gunwales with ECC RAM.
Cost him the equivalent of $7k or so. Eeek.
0: https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-p/Thi...
Xeon with ECC are not that overpriced compared with similar Core without. Likewise, RAM sticks with ECC are cheap to produce (basically just one more chip to populate per side per module). Likewise soldered RAM would simply add maybe $10 or $20 of extra chips.
I'm guessing you won't find any.
Now back to ECC, I'll probably be corrected, but I don't think ECC helps gain more than two order of magnitudes, so we still need incredibly reliable RAM. If we move to ECC RAM by default everywhere, aren't we simply going to get less reliable RAM at the end?
So I'd say ECC is not only important but insanely impactful. There's a reason why many organizations don't even want to hear about getting rigs with non-ECC memory.
I understand altitude has some kind of proportionality to cosmic ray exposure, and number of bits will multiply the probability of an error.. I'm presuming there is also an inherent error rate to DRAM separate from environment. But what are those numbers.
"33 to 600 days to get a 96% chance of getting a bit error." Still, it seems way too high. I guess anyone with ECC RAM could confirm that they are getting those sort of recovered error rates?
My point is, when you say there is a "96% chance of having an error in THREE DAYS", one would EXPECT to be having issues like.. all the time? So I'm not disagreeing with you, but with the amount of non-ECC machines all over the world and how insanely stable modern machines are, it still seems like a very low risk.
Now of course I agree that if you want to take every precaution, go ECC, but simple observation prove that this "problem" can't be as bad as the numbers are saying.
As far as we know, our computer has never had an undetected error. -- Weisert
If it needs ECC memory to do that, then fit it with ECC memory. If there are other ways to achieve that (for example deeper dram cells to be more robust to cosmic rays) that's fine too.
Just meet the reliability spec - I don't care how.
That's why I've always been on the fence with this ECC thing. For servers it's vital because you need stability and security.
For desktops I think that for a long time it was fine without ECC. If I have to chose between having, say, 30% more RAM or avoid a potential crash once a year, I'll probably take the additional RAM.
The problem is that now these problem can be exploited by malicious code instead of just merely happening because of cosmic rays. That's the main argument in favour of ECC IMO, the rest is just a tradeoff to consider.
Basically you can register domains using small bit differences for domains and start getting email and such for that domain
If I recall correctly the example given was a variation of microsoft.com
All because so much equipment doesn't use ECC
At Google even with ECC everywhere there wasn't enough systematic error detection and correction to prevent the global database of monitoring metrics from filling up with garbage. /rpc/server/count was supposed to exist but also in there would be /lpc/server/count and /rpc/sdrver/count and every other thing. Reminded me daily of the terrors of flipped bits.
I run some large ML models in my home PC and I get NaN's and some out of range floats every month or so. I have spent hours debugging but doing the same computation with the same random seeds does not recreate the problem.
How about GPU's and their GDDR SDRAM? Do they have parity bits?
[1]: https://twitter.com/catfish_man/status/1335373029245775872?l...
If you think it doesn't matter: how do you know? If you don't run with ECC memory, you'll never know if memory was corrupted (and recovered).
That blue screen, that sudden reboot, that program crashing. That corrupted picture of your kid.
Who knows.
I'll tell you, who knows. God damn every sysadmin (or the modern equivalent) can tell you how often they get ECC errors. And at even a small scale you'll encounter them. I have, on servers and even on an SAN Storage controller, for crying out loud.
If you care about your data, use ECC memory in your computers.
But.. in all my time operating servers over 3 decades, it's always been bad drivers, bad code and problematic hardware that's caused most of my headaches.
Have i seen ECC error correction in logs? yeah.. I don't advocate against it but, i've found for most people you design around multiple failure scenarios more than you design around preventing specific ones.
Take the average web app - you run it on 10 commodity systems and distribute the load.. if one crashes, so what. Chances are, a node will crash for many more reasons other than memory issues.
If you have an app that requires massive amounts of ram or you do put all of your begs in one basket, then ECC makes sense...
I just know i like going horizontal and I avoid vertical monoliths.
It’s a tradeoff between money/performance and the frequency of crashes, corruption etc.
Bit rot is just one of many threats to my data. Backups take care of that as well as other threats like theft, fire, accidental deletion.
This is similar to my reasoning around the recent side channel attacks on intel CPUs. If I had a choice I’d like to run with max performance without the security fixes even though it would be less secure. Not because I don’t care about security but because 1% or 5% perf is a lot and I’d rather simply avoid doing anything security critical on the machine entirely than take that hit.
I would have bought computers when I "wanted one". Now I buy them when I need one. Because buying a non-ECC computer just feels like buying a defective product.
In the last 10 years I would have bought TWICE as many computers if they hadn't segmented their market.
Fuck intel. I sense that Linus self-censored himself in this post, and like me is even angrier than the text implies.
Price is not nice though.
I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM. I can't get that, at all - even without the other fancy things I would like such as a 4k OLED with pen/touchscreen.
In 2020, even the Dell XPS stopped shipping OLED (goodbye dear 7390!)
I will gladly give my money to anyone who sells AMD laptop with ECC. Hopefully, it will show there's demand for "high end yet non bulky laptops"
I hope AMD will create a better market for the ECC laptop memory (right now it's hard to find + expensive).
If in doubt, get ECC. Do your own research on how it works and why. This post won’t explain it, just will blame Intel (probably rightfully so).
> We have decades of odd random kernel oopses that could never be explained and were likely due to bad memory. And if it causes a kernel oops, I can guarantee that there are several orders of magnitude more cases where it just caused a bit-flip that just never ended up being so critical.
It might be false, but I think it's a reasonable assumption.
The industry has convinced the average user of consumer hardware that PPA (Power,Performance,Area) is all that needs to get better with generational improvements. Hoping that the concerning aspects of security and reliability that have come to light in the recent past changes this.
Naively, I can understand why error reporting has dependencies on other parts of the system, but it would seem possible for error correction to work transparently.
Modern CPUs have integrated memory controllers, so that's why the CPU needs to support it.
Correction without reporting isn't great; anyway, you need a reporting mechanism for uncorrectable errors, or all you've done is ensure any memory errors you do experience are worse.
This is in line with all technical parameters of DRAM: everything must be as cheap as possible, and all the difficult parts are moved to the memory controller.
Which is the right thing to do, because you can share one memory controller with multiple DRAM chips.
I would think the only guaranteed solutions to Rowhammer are actually cryptographic digests and/or guard pages.
[1] https://www.zdnet.com/article/rowhammer-attacks-can-now-bypa...
However, flipped three bits simultaneously isn't trivial, and the attempts that flip fewer bits will be detected and logged.
https://eclecticlight.co/2020/12/09/what-happens-when-an-m1-...
I have spent 36 years fielding embedded devices in core network (D1/E1, SONET, ROADM/MPLS, Cellular basestation) and I will tell you that large ECC covered memory arrays always show small numbers of correctable error events over the course of a year. I have seen, over the course of my career, exactly one controller card replaced early in the field, because it started throwing excessive recoverable ECC events over time, until it hit a threshold of 10x the average of a typical board. On the order of ten recoverable ECC events per month instead of one event per month. I have never observed a logged non-correctable ECC event in the field. In the lab, yes, but never in fielded equipment.
If you are fine with your PC experiencing one or two bits flipped in memory every month, then you really don't need ECC. That is the question you need to answer.
For mission critical systems? ECC is a requirement.
The phrase that strikes me is "horribly bad market segmentation". I agree 100%.
Remember when the Pentium/pro/2/3 could operate in single and dual socket configurations with ECC? The same CPU that plugged into your low end consumer board could also plug into a high end server/workstation board. All you needed was the right motherboard.
[1] https://web.archive.org/web/*/https://www.realworldtech.com/...
I am not talking about servers dealing with critical data.
Suppose that I maintain a repository (documents, audio and video), one copy in a ZFS-ECC system and one in an ext4-nonECC system.
Would I notice a difference between these two copies after 5-10 years?
That tells us if ECC matters for most people.
The most likely impact (other than nothing, if bits are flipped in unused memory) is program crashes or system lock-ups for no apparent reason.
https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...
I specifically was looking for bang for buck, low(er) wattage and ECC.
Offtopic, I wonder if he trawls that site regularly. And eventually I wonder, is he here also? :)
Does ECC memory support dual channel??
ECC memory = memory with Error-Correcting Code
ECC encryption = Elliptic Curve Cryptography
If you care about ECC, you pay for Xeon. Majority of consumers don't run critical applications on their devices, so they are happy with a cheap device that may crash once in a while.
AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.
Honestly, what is the point of bullet-proof hardware when the software reliability (at least on consumer devices) has gone down to two nines.
> AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.
You are correct to call them a corporation. AMD is not your friend, but they are the good actor in this fight.
It might present itself as a 1pixel colour difference, but it could be more damaging (incorrect finances, in accounting software for example). Software trusts memory; but memory can lie.
That’s dangerous.