ECC matters (opens in new tab)

(realworldtech.com)

1053 pointsrajesh-s5y ago550 comments

550 comments

I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

Google's initial strategy (c. 2000) around this was to save a few bucks on hardware, get non-ECC memory, and then compensate for it in software. It turns out this is a terrible idea, because if you can't count on memory being robust against cosmic rays, you also can't count on the software being stored in that memory being robust against cosmic rays. And when you have thousands of machines with petabytes of RAM, those bitflips do happen. Google wasted many man-years tracking down corrupted GFS files and index shards before they finally bit the bullet and just paid for ECC.

ksec5y ago

>I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

Did they ( Google ) or He ( Craig Silverstein ) ever officially admit it on record? I did a Google search and results that came up were all on HN. Did they at least make a few PR pieces saying that they are using ECC memory now because I dont see any with searching. Admitting they made a mistake without officially saying it?

I mean the whole world of Server or computer might not need ECC insanity was started entirely because of Google [1] [2] with news and articles published even in the early 00s [3]. And after that it has spread like wildfire and became a common accepted fact that even Google doesn't need ECC. Just like Apple were using custom ARM instruction to achieve their fast JS VM performance became a "fact". ( For the last time, no they didn't ). And proponents of ECC memory has been fighting this misinformation like mad for decades. To the point giving up and only rant about every now and then. [3]

[1] https://blog.codinghorror.com/building-a-computer-the-google...

[2] https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

[3] https://danluu.com/why-ecc/

djur5y ago

Your [3] has a footnote quoting a Google book that reads "Modern DRAM systems are a good example of a case in which powerful error correction can be provided at a very low additional cost... The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM"

1 more reply

sitkack5y ago

The fact that ECC isn't the default across everything is a failure of human cognition and Capitalism.

2 more replies

starfallg5y ago

Recent advances have blurred the lines a bit. The ECC memory that we all know and love is mainly side-band EEC, with the memory bus widened to accommodate the ECC bits driven by the memory controller. However as process size shrink, bit flips become more likely to the point that now many types of memory have on-die EEC, where the error correction is handled internally on the DRAM modules themselves. This is present on some DDR4 and DDR5 modules, but information on this is kept internal by the DRAM makers and not usually public.

https://semiengineering.com/what-designers-need-to-know-abou...

There has been a lot of debate regarding this that was summarised in this post -

https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

wtallis5y ago

> This is present on some DDR4 and DDR5 modules, but information on this is kept internal by the DRAM makers and not usually public.

On-die ECC is going to be a standard feature for DDR5. I'm not aware of any indication that anyone has implemented on-die ECC for DDR4 DRAM, and Hynix at least has made clear statements that on-die ECC is new for their DDR5 and was not present in their DDR4.

1 more reply

tyoma5y ago

Figure this is as good of a time as any to ask this:

There are many various DRAMs in a server (say, for disk cache). Has Google or anyone who operates at a similar scale seen single bit errors in these components?

bsder5y ago

This is as old as computing and predates Google.

When America Online was buying EV6 servers as fast as DEC could produce them, they used to see about about 1 double bit error per day across their server farm that would reboot the whole machine.

DRAM has only gotten worse--not better.

gh02t5y ago

The supercomputing community has looked at some of the effect on different parts of the GPU.

https://ieeexplore.ieee.org/abstract/document/7056044

sitkack5y ago

Yes.

Bit flips (for all reasons) occur in buses, registers, caches, etc. Anything that has state can have state changed incorrectly.

This is why filesystems like ZFS exist and storage formats have pervasive checksums.

itisit5y ago

New Yorker article that credits Jeff Dean and Sanjay Ghemawat with discovering the company’s bitflip issue:

https://www.newyorker.com/magazine/2018/12/10/the-friendship...

grishka5y ago

I remember reading how someone registered some google domains with a single bit flipped, and saw actual requests coming to them.

andrewstuart25y ago

If you or anybody can remember the source article, that sounds like an interesting read!

Edit: found one with a quick search. https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...

And https://www.researchgate.net/publication/262273269_Bitsquatt...

2 more replies

Faaak5y ago

A long time ago I did that with "CDN" domains. I bought ~10 of them (variations of fbcdn, akamai, and ytimg). I _did_ see some traffic (some hits per hour if I remember well), and many of them were from cheap handheld phones (from the user-agent).

NickNameNick5y ago

That works for any domain that's busy enough.

Random bit flips happen on client machines and on routers.

If there are enough requests for a domain name, some of those requests will be subject to one one of those bit-flips.

saagarjha5y ago

That might just be typos in some cases?

gigatexal5y ago

I mean early on sure at a startup where you’re not printing money I can see how saving on hardware makes sense. But surely you don’t need an MBA to know that hardware will continue to get cheaper whereas developers and their time will only get more expensive: better to let the hardware deal with it than to burden developers with it ... I’d have made the case for ECC but hindsight being what it is ...

colejohnson665y ago

But if you can save $1M+ now, then throw the cost of fixing it onto the person who replaces you, why do you care? You already got your bonus and jumped ship.

1 more reply

finiteloop5y ago

One of the best quotes in the Google quotes file an early Googler maintained (I am sure I am screwing it up):

“I’ve heard of defensive programming, but never adversarial memory.” — Ben Gomes

fragmede5y ago

Close!

> I've never thought of defensive programming in terms of adversarial memory.

maria_weber235y ago

ECC memory can't eliminate the chances of these failures entirely. They can still happen. Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code. So in theory the behavior of software under random bit flips is well... Random. You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum. I could imagine that doing so would still be cheaper than using ECC ram, at least around 2000.

Generally this goes against software engineering principles. You don't try to eliminate the chances of failure and hope for the best. You need to create these failures constantly (within reasonable bounds) and make sure your software is able to handle them. Using ECC ram is the opposite. You just make it so unlikely to happen, that you will generally not encounter these errors at scale anymore, but nontheless they can still happen and now you will be completely unprepared to deal with them, since you chose to ignore this class of errors and move it under the rug.

Another intersting side effect of quorum is that it also makes certain attacks more difficult to pull off, since now you have to make sure that a quorum of machines gives the same "wrong" answer for an attack to work.

giantrobot5y ago

I don't think ECC is going to give anyone a false sense of security. The issue at Google's scale is they had to spend thousands of person-hours implementing in software what they would have gotten for "free" with ECC RAM. Lacking ECC (and generally using consumer-level hardware) compounded scale and reliability problems or at least made them more expensive than they might otherwise had been.

Using consumer hardware and making up reliability with redundancy and software was not a bad idea for early Google but it did end up with an unforeseen cost. Just a thousand machines in a cosmic ray proof bunker will end up with memory errors ECC will correct for free. It's just reducing the surface area of "potential problems".

1 more reply

AaronFriel5y ago

It can't eliminate it but:

1. Single bitflip correction along with Google's metrics could help them identify algorithms they've got, customer's VMs that are causing bitflips via rowhammer and machines which have errors regardless of workload

2. Double bitflip detection lets Google decide if they say, want to panic at that point and take the machine out of service, and they can report on what software was running or why. Their SREs are world-class and may be able to deduce if this was a fluke (orders of magnitude less likely than a single bit flip), if a workload caused it, or if hardware caused it.

The advantage the 3 major cloud providers have is scale. If a Fortune 500 were running their own datacenters, how likely would it be that they have the same level of visibility into their workloads, the quality of SREs to diagnose, and the sheer statistical power of scale?

I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.

1 more reply

saagarjha5y ago

There was an interesting challenge at DEF CON CTF a while back that tested this, actually. It turns out that it is possible to write x86 code that is 1-bit-flip tolerant–that is, a bit flip anywhere in its code can be detected and recovered from with the same output. Of course, finding the sequence took (or so I hear) something like 3600 cores running for a day to discover it ;)

1 more reply

tomxor5y ago

> Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code.

There is an OS that pretty much fits the bill here. There was a show where Andrew Tanenbaum had a laptop running Minix 3 hooked up to a button that injected random changes into module code while it was running to demonstrate it's resilience to random bugs. Quite fitting that this discussion was initiated by Linus!

Although it was intended to protect against bad software I don't see why it wouldn't also go a long way in protecting the OS against bitflips. Minix 3 uses a microkernel with a "reincarnation server" which means it can automatically reload any misbehaving code not part of the core kernel on the fly (which for Minix is almost everything). This even includes disk drivers. In the case of misbehaving code there is some kind of triple redundancy mechanism much like the "quorum" you suggest, but that is where my crude understanding ends. AFAIR Userland software could in theory also benefit provided it was written in such a way to be able to continue gracefully on reloading.

2 more replies

slumdev5y ago

Error-correcting code (the "ECC" in ECC) is just a quorum at the bit level.

2 more replies

DSingularity5y ago

You need two alpha particles hitting the same rank of memory for failure to happen. Although super rare, even then it is still correctable. You need three before it is silent data corruption. Silent corruption is what you get with non ECC with even a single flip.

1 more reply

hn33335y ago

Bit flips can happen, but regardless if they can get repaired by ECC code or not, the OS is notified, iirc. It will signal a corruption to the process that is mapped to the faulty address. I suppose that if the memory contains code, the process is killed (if ECC correction failed).

1 more reply

colejohnson665y ago

> You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum.

The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.

4 more replies

sobriquet95y ago

If you use multiple computers doing the same calculation and then take the answer from the quorum, how do you ensure the computer that does the comparison is not affected by memory failures? Remember that all queries have to through it, so it has to be comparable in scale and power.

1 more reply

dijit5y ago

I beg this, every time this conversation comes up it’s the same answer “I don’t see a problem”.

It’s so easy to chalk these kind of errors to other issues, a little corruption here, a running program goes bezerk there- could be a buggy program or a little accidental memory overwrite. Reboot will fix it.

But I ran many thousands of physical machines, petabytes of RAM, I tracked memory flip errors and they were _common_; common even in: less dense memory, in thick metal enclosures surrounded by mesh. Where density and shielding impacts bitflips a lot.

My own experience tracking bitflips across my fleet led me to buy a Xeon laptop with ECC memory (precision 5520) and it has (anecdotally) been significantly more reliable than my desktop.

loeg5y ago

Yeah, it's real obnoxious of Intel to silo ECC support off into the Xeon line, isn't it? I switched to ECC memory in 2013 or 2014 with a Xeon E3 (fundamentally a Core i7 without the ECC support fused off) and of course a Xeon-supporting motherboard (with weird "server board" quirks: e.g., no on-board sound device).

I love that AMD doesn't intentionally break ECC on its consumer desktop platforms and upgraded to the Threadripper in 2017.

defanor5y ago

I've considered using an AMD CPU instead of Intel's Xeon on the primary desktop computer, but even low-end Ryzen Threadripper CPUs have TDP of 180W, which is a bit higher than I'd like. And though ECC is not disabled in Ryzen CPUs, AFAIK it's not tested in (or advertised for) those, so one won't be able to return/replace a CPU if it doesn't work with ECC memory, AIUI, making it risky. Though I don't know how common it is for ECC to not be handled properly in an otherwise functioning CPU; are there any statistics or estimates around?

9 more replies

CalChris5y ago

Yeah, the iMac Pro has the Xeon W and ECC. T'would be nice if the Apple Silicon MacBook Pro had it. There's not much of a reason to pay for the Pro over the Air. But like Linus, I'm going to blame Intel for this situation in the market. Maybe Apple will strike out on its own with Apple Silicon but since their dominant use case is phones, I'll not hold my breath.

1 more reply

fortran775y ago

While it's true that Intel only has ECC support on Xeon (and several other chips targeted at the embedded market) it's not true that ECC is supported well on AMD.

We only use Xeons on developer desktops and production machines here precisely because of ECC. It's about 1 bit flip/month/gigabyte. That's too much risk when doing something critical for a client.

3 more replies

rhn_mk15y ago

Doesn't intel make ECC available on the i3 line of CPUs?

3 more replies

vbezhenar5y ago

You can find non-Xeons with ECC support. But they are rare and usually suitable for some kinds of micro servers.

derefr5y ago

Were you around for enough DRAM generations to notice an effect of DRAM density / cell-size on reported ECC error rate?

I’ve always believed that, ECC aside, DRAM made intentionally with big cells would be less prone to spurious bit-flips (and that this is one of the things NASA means when they talk about “radiation hardening” a computer: sourcing memory with ungodly-large DRAM cells, willingly trading off lower memory capacity for higher per-cell level-shift activation-energy.)

If that’s true, then that would mean that the per-cell error rate would have actually been increasing over the years, as DRAM cell-size decreased, in the same way cell-size decrease and voltage-level tightening have increased error rate for flash memory. Combined with the fact that we just have N times more memory now, you’d think we’d be seeing a quadratic increase in faults compared to 40 years ago. But do we? It doesn’t seem like it.

I’ve also heard a counter-effect proposed, though: maybe there really are far more “raw” bit-flips going on — but far less of main memory is now in the causal chain for corrupting a workload than it used to be. In the 80s, on an 8-bit micro, POKEing any random address might wreck a program, since there’s only 64k addresses to POKE and most of the writable ones are in use for something critical. Today, most RAM is some sort of cache or buffer that’s going to be used once to produce some ephemeral IO effect (e.g. the compressed data for a video frame, that might decompress incorrectly, but only cause 16ms of glitchiness before the next frame comes along to paper over it); or, if it’s functional data, it’s part of a fault-tolerant component (e.g. a TCP packet, that’s going to checksum-fail when passed to the Ethernet controller and so not even be sent, causing the client to need to retry the request; or, even if accidentally checksums correctly, the server will choke on the malformed request, send an error... and the client will need to retry the request. One generic retry-on-exception handler around your net request, and you get memory fault-tolerance for free!)

If both effects are real, this would imply that regular PCs without ECC should still seem quite stable — but that it would be a far worse idea to run a non-ECC machine as a densely-packed multitenant VM hypervisor today (i.e. to tile main memory with OS kernels), than it would have been ~20 years ago when memory densities were lower. Can anyone attest to this?

(I’d just ask for actual numbers on whether per-cell per-second errors have increased over the years, but I don’t expect anyone has them.)

mlyle5y ago

I think it's been quadratic with a pretty low contribution from the order 2 term.

Think of the number of events that can flip a bit. If you make bits smaller, you get a modestly larger number of events in a given area capable of flipping a bit, spread across a larger number of bits in that area.

That is, it's flip event rate * memory die area, not flip event rate * number of memory bits.

In recent generations, I understand it's even been a bit paradoxical-- smaller geometries mean less of the die is actual memory bits, so you can actually end up with fewer flips from shrinking geometries.

And sure, your other effect is true: there's a whole lot fewer bitflips that "matter". Flip a bit in some framebuffer used in compositing somewhere-- and that's a lot of my memory-- and I don't care.

jeffreygoesto5y ago

Sorry, I don't have the numbers you asked for. But afaik one other effect is that "modern" semiconductor processes like FinFET and Fully-Depleted Silicon-on-Insulator are less prone to single event upsets and especially result in only a single bit flipping and no drain of a whole region of transistors from a single alpha particle.

incrudible5y ago

When you say bitflips were "common" on thousands of physical machines, does that mean you observed thousands of bitflips?

Otherwise, I would think that an unlikely event becoming 1000x more likely by sheer numbers would have warped your perception.

I believe that hardware reliability is mostly irrelevant, because software reliability is already far worse. It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail, what matters is that this failure is handled gracefully.

felixhandte5y ago

Yes. I work at Facebook on data compression.

The libraries we maintain (1) are responsible for a non-trivial part of Facebook's overall compute footprint, (2) should basically never fail of their own accord, and (3) have pretty good error monitoring. So my team is operating what is effectively (among other things) a very sensitive detector for hardware failure.

And indeed we see examples all the time of blobs that fail to decompress, and usually when we dig in we find that the blob is only a single bit-flip away from a blob that decompresses successfully into a syntactically correct message. I can't share numbers, but, off the top of my head, I think it's the largest source of failures we see. It happens frequently enough that I wrote a tool to automate checking [0].

So yes. It happens. Pretty frequently, in the sense that if you're doing xillions of operations a day, a one-in-a-xillion failure happens all the time.

[0] https://github.com/facebook/zstd/tree/dev/contrib/diagnose_c...

adrian_b5y ago

On a single computer with a large memory, e.g. 32 GB or more, the time between errors can be of a few months, if you are lucky to have good modules. Moreover, some of the errors will have no effect, if they happened to affect free memory.

Nevertheless, anyone who uses the computer for anything else besides games or movie watching, will greatly benefit from having ECC memory, because that is the only way to learn when the memory modules become defective.

Modern memories have a shorter lifetime than old memories and very frequently they begin to have bit errors from time to time long before breaking down completely.

Without ECC, you will become aware that a memory module is defective only when the computer crashes or no longer boots and severe data corruption in your files could have happened some months before that.

For myself, this was the most obvious reason why ECC was useful, because I was able in several cases to replace memory modules that began to have frequent correctable errors, after many years with little or no errors, without losing any precious data and without downtime.

2 more replies

tyoma5y ago

It depends where the failure happens. Sometimes you really lose the “failure in the wrong place” lottery. For example, in a domain name: http://dinaburg.org/bitsquatting.html

dkersten5y ago

Another comment[1] mentioned 1 bitflip per gigabyte per month. If you have a lot of RAM, that's rather a lot.

> It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail

Except that a bitflip can go undetected. It may crash your software or system, but it also may simply leak errors into your data, which can be far more catastrophic.

[1] https://news.ycombinator.com/item?id=25623206

2 more replies

ikiris5y ago

Its enough that graphs can show you solar weather.

I can't give my source, but its far higher than most people think. Just pay the money.

jjeaff5y ago

Ya, I'm not buying that biyflips are a problem. Or maybe modern software can correct better for this? Because I use my desktop all day everyday running tons of software on 64 gb of ram and I don't get errors or crashes often enough to remember ever having one.

2 more replies

smoyer5y ago

There is no guarantee of state at the quantum level ... just a high-degree of assurance in a state. After 40 years in the electronics, optics, software business, I've learned that there is absolutely the possibility for unexplained "blips".

hosteur5y ago

How did you track memory errors across thousands of physical machines?

core-questions5y ago

https://github.com/netdata/netdata/issues/1508

Looks like `mcelog --client` might be a starting place? Feed that into your metrics pipeline and alert on it like anything else...

1 more reply

ikiris5y ago

The same way you do it with everything else, export the telemetry and store it in time series...

lighttower5y ago

Can you get decent battery life with this ecc memory in a laptop?

dijit5y ago

Yes. ECC memory uses only marginally more power than non-ECC memory. And memory isn’t the largest consumer of battery life by a country mile.

Screen, Wi-Fi, and to a much lesser extent (unless under load) the CPU are the most major culprits of low battery life.

1 more reply

cbanek5y ago

As someone who has had to read thousands of random game crash reports from all over the interwebs (you know when Windows says you might want to send that crash log? like that), I totally agree.

Of all the things to be worried about, like OS bugs, bad hardware configuration, etc. bad memory is one of those really troubling things. You look at the code and say "it's can't make it here, because this was set" but when you can't trust your memory you can't trust anything.

And as the timeline goes to infinity, you may also get one of these reports and be asked to fix it... good luck.

TonyTrapp5y ago

One of my favourite stories about this type if stuff is from Patrick Wyatt on developing Guild Wars: https://www.codeofhonor.com/blog/whose-bug-is-this-anyway

See "Your computer is broken". They essentially inserted a stress test into the game that verified if the hardware was still doing calculations correctly, and if not, inform the user.

Springcleaning5y ago

Worse than a game crash is your data.

It is incomprehensible that there are still NAS devices being sold without ECC support.

Synology took a step in the right direction to offer prosumer devices with ECC but it is not really advertised as such. It is actually difficult to find which do have ECC and which ones don't.

ksec5y ago

>Synology took a step in the right direction to offer prosumer devices with ECC

I just look it up because if it was true it would have been news to me. Synology have been known to be stingy with Hardware Spec. But none of what I called Prosumer, the Plus Series have ECC memory by default. And there are "Value" and "J" Series below that.

Edit: Only two model from the new xx21 series using AMD Ryzen V has ECC memory by default.

dannyw5y ago

Blame Intel for that. They dropped support for ECC RAM on the new Celerons and i3s.

apankrat5y ago

Aye. I have an assert in the code that fronts a very pedantic test of the context. In all cases when this assert was tripped (and reported) an overnight memtest86 test surfaced RAM issues.

- Edit -

Also, bit flips in the non-ECC memory are _the_ cause of the "bitrot" phenomenon. That is when you write out X to a storage device, but you get Y when you read it back. A common explanation is that the corruption happens _at rest_. However all drives from the last 30+ years have FEC support, so in reality the only way a bit rot can happen is if the data is damaged _in transit_, while in RAM, on the way to/from the storage media.

So, if you ever decide if to get an ECC RAM, get it. It's very much worth it.

masklinn5y ago

> Also, bit flips in the non-ECC memory are _the_ cause of the "bitrot" phenomenon. That is when you write out X to a storage device, but you get Y when you read it back. A common explanation is that the corruption happens _at rest_. However all drives from the last 30+ years have FEC support, so in reality the only way a bit rot can happen is if the data is damaged _in transit_, while in RAM, on the way to/from the storage media.

Problems can definitely happen in the IO controller, RAID controller, cable, and disk controller. AFAIK all of these were seen and motivations for the existence of ZFS. One of their biggest insistence was that drives are universally lying bastards and should not be trusted any further than they can be thrown.

srtjstjsj5y ago

Bitrot in human memory is the same. Memories change during the process of recalling them, not while they are in "storage".

1 more reply

BlueTemplar5y ago

Yeah, here's one example along many more :

https://forums.factorio.com/viewtopic.php?p=405060#p405060

pkaye5y ago

I wonder how much of those crashes are due to gamers aggressively overclocking their systems?

lighttower5y ago

Someone reads those reports!?! Wow, how do I write them to ensure someone who reads them takes them seriously?

xmodem5y ago

The best way is to submit the same crash report from thousands of different locations, repeatedly

faitswulff5y ago

Do the crash reports include whether the machine has ECC memory?

jacquesm5y ago

On intel consumer boxes it is pretty safe to assume that they don't, on AMD it might be the case but it usually isn't.

jackric5y ago

Do the crash reports include recent solar activity?

2 more replies

Triv8885y ago

most gaming desktops don't use ECC RAM anyways (at least those from a few years ago)

zdw5y ago

Good news is that for DDR5, ECC is a required part of the spec and should be a feature of every module:

https://www.anandtech.com/show/15912/ddr5-specification-rele...

toast05y ago

On die ECC is great for increasing reliability, if all else is equal, but if it doesn't report to the memory controller, and if the memory controller doesn't report to the OS, I think it will be worse than status quo, because all else won't be equal. With no feedback, systems are going to continue to run on the edge, but now detectable failures will all be multi-bit; because single bit errors are hidden.

cududa5y ago

Huh? Why would the memory controller not be updated accordingly? Also I have no idea about Linux or Mac, but Windows has had ECC support and active management for decades?

2 more replies

rajesh-sOP5y ago

A whitepaper on DDR4 ECC by Micron that goes over some of the implementation challenges

https://media-www.micron.com/-/media/client/global/documents...

hinkley5y ago

Is it built in as an added feature, or as the only way to make DDR5 reliable? My inner cynic is screaming the latter.

When the value add feature becomes a necessity, it’s not a value add any more.

CoolGuySteve5y ago

I always wondered why isn't ECC built into the memory controller, the same hardware that runs the bus into L3 or the page mapper could checksum groups of cachelines.

It seems redundant to have every module come with its own checking hardware.

p_l5y ago

ECC is a function of memory controller, not memory, on current systems. There's also usually some form of ECC on whatever passes for system bus, and internal caches have ECC as well.

For memory controller, parity/ECC/chipkill/RAIM usually involved simply adding additional memory planes to store correction data. I believe the rare exceptions are fully buffered memories where you have effectively separate memory controller on each module (or add-in card with DIMMs)

kasabali5y ago

AFAIK it is built into the memory controller, at least for ECC UDIMM. There's an extra DRAM chip on the module for parity (generally 8+1), but it is memory controllers responsibility to utilize it (that's why not all CPUs support ECC)

bradfa5y ago

I read it to say that on die ecc is recommended but that dimm-wide ecc is still optional.

And now you have 8 bits of ecc per 32 data versus older DDR having 8 bits of ecc per 64 data. Hence the cost for dimm-wide ecc is going up.

simias5y ago

I used to be pretty skeptical of ECC for consumer-grade hardware, mainly because I felt that I'd always prefer cheaper/more RAM over ECC RAM even if it meant that I'd get a couple of crash every year due to rogue bitflips. For servers it's a different story, but for a desktop I'm fine dealing with some instability for better performance.

But these days with the RAM density being so high and bitflipping attacks being more than a theoretical threat it seems like there's really no good reason not to switch to ECC everywhere.

tokamak-teapot5y ago

Are there any Ryzen boards that support ECC and actually correct errors?

adrian_b5y ago

As others have replied, all the ASRock boards where I have ever checked the specifications do support ECC and also some ASUS boards support ECC, e.g. all ASUS workstation boards.

Because ECC means Error Correcting Code, by definition, any board that claims ECC support must actually correct the errors. The ECC codes used now, with 8 extra bits for each 64 data bits, correct any 1-bit error and detect any 2-bit errors.

Very old computers (25 years old, or more) used parity instead of ECC and they just detected any 1-bit error (and any errors with an odd number of flipped bits), without being able to correct the errors.

gruez5y ago

quick search:

https://rog.asus.com/forum/showthread.php?112750-List-Asus-M...

1 more reply

dannyw5y ago

Yes, almost all of them correct single-bit flips and detect but do not correct multiple hit flips.

loeg5y ago

Yes. E.g., all ASRock boards.

fulafel5y ago

The functionality seems to all be in the memory controller integrated to the CPU.

fctorial5y ago

> cheaper/more RAM

It's faster too.

ekianjo5y ago

> no good reason not to switch to ECC everywhere.

Not all CPUs support ECC however.

josefx5y ago

Just Intel fucking over security by making ECC a non feature on consumer grade hardware - wouldn't be surprised if it was just a single bit flipped in a feature mask.

2 more replies

loeg5y ago

(Intel)

otterley5y ago

About 1/3 of Google's machines and 8% of Google's DIMMs in their fleet suffer at least one correctible memory error per year: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

jjeaff5y ago

Which means, assuming google is running very large machines with lots of memory that one might expect a single correctable error once every 6-10 years on your average workstation of small server. That's generously assuming your workstation has 1/3 as much memory as the average google server.

Nebasuke5y ago

Google does not use very large or even large machines for most of their fleet. You can quickly see in the paper this is for 1, 2, and 4 GB RAM machines (in 2006-2008).

1 more reply

tpetry5y ago

With a single bit flip on 8% of the dimms you only need 12.5 dimms in your workstation to have one bit flip every year. Not everyone has that much dimms, but at least 4 is pretty normal. So in average every 3 years for every workstation.

But i don‘t know how relevant these metrics from 2009 are. Did memory got better or worse compared to 2009 for bit flips?

petermcneeley5y ago

I would also add that Row Hammer Attacks are much harder on ECC.

When I first tried to replicate the row hammer attack I was not getting any results. Turns out I was doing this on ECC. On non ECC memory the same test easily replicated the row hammer attack.

https://en.wikipedia.org/wiki/Row_hammer

kensai5y ago

“ECC availability matters a lot - exactly because Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation.”

Its.

There, I finally corrected Linus Torvalds in something. :))

JosephRedfern5y ago

Maybe he composed the message using a machine with non-ECC RAM and suffered a bit flip, which through some chain of events, led to the ' being added. Best to give him the benefit of doubt, I think!

notretarded5y ago

The mistake was that it was included.

1 more reply

jacquesm5y ago

How is your Finnish?

jankeymeulen5y ago

Or Swedish for that matter, as I believe Torvalds maternal language is Swedish

3 more replies

xxs5y ago

Linus must have English as his '1st' language now. For non-originally-native speaker mistakes like 'it's vs its', 'than vs then', etc. are pretty uncommon.

2 more replies

Igelau5y ago

It could use some Polish.

1 more reply

mark-r5y ago

I have a simple way of remembering when to leave out the apostrophe. His, hers, its are all possessive and none of them have an apostrophe.

Glanford5y ago

In this particular case 'it's' can also be possessive although it's considered non-standard, so to be correct you can always treat it like a contraction of 'it is'.

1 more reply

hugey0105y ago

He uses "do do" instead of "to do" which is a more obvious typo. Linus usually comes across as borderline arrogant, and deservedly so, but not necessarily perfect in his writing. I think it's an effective strategy to communicate his priorities and wrangle smart but easily intimidated folk "do do" what he believes is right!

raverbashing5y ago

Yeah I'm always annoyed with this kind of mistake. Especially as non-native speakers should know better than the native ones (which usually don't give a f.).

Now the point about internally doing ECC is an interesting one, could be a way out of this mess. And apparently ECC is more available in AMD land

simias5y ago

For a 2nd language speaker making these homophonic mistakes is actually a sign of fluency. It means that you just transcribe a mental flow of words instead of consciously constructing the language.

The first time I wrote "your" instead of "you're" in English I thought it was quite a milestone!

3 more replies

tssva5y ago

The really annoying thing is that auto correct on mobile device keyboards will often want to incorrectly change "its" to "it's" or vice versa.

1 more reply

phkahler5y ago

>> But is ECC more available in AMD land?

Yes it is. The problem is they dont really advertise it. I'm not certain but it might even be standard on AMD chips, but if they dont say so and board makers are also unclear, who knows...

ethbr05y ago

It's a market size problem.

For consumer motherboard OEMs, only AMD effectively has ECC support (Intel's has been so spotty and haphazard from product to product), and of AMD users, only a small number care about ECC.

So motherboard companies, being resource and time-starved as they are, don't make it a priority to address such a small user-base.

If Intel started shipping ECC on everything, it would go a long way towards shifting the market.

1 more reply

touisteur5y ago

I think it's available for customer SKUs on AMD and not just for servers like in 'Xeon-land'... How I've wanted an ECC-ready NUC...

2 more replies

young_unixer5y ago

I've realized that when I'm engaged in the writing (angry or emotional in some way) I tend to commit more of these mistakes, even though I know the difference between "it's" and "its". Linus is always angry, so that probably makes him commit more orthographic mistakes.

africanboy5y ago

As a non native speaker, my phone has both the Italian and English dictionary, when I write its it always auto corrects to it's as soon as I hit space and sometimes it gets unnoticed.

MarkusWandel5y ago

This is one justified Linus rant! My personal history includes data loss twice because of defective RAM, and many more RAMs discarded after the now obligatory overnight run of MemTest86+ (these were all secondhand RAMs - I would never buy a new one without a refund guarantee). My very first "PC" still had the ECC capability and I used it. My own now very dated rant on the subject: http://wandel.ca/homepage/memory_rant.html

mixmastamyk5y ago

A few years back memtest86 wouldn’t run on newer machines, has that been fixed?

MarkusWandel5y ago

Wouldn't know, I don't run newer machines. But since it's a boot option on Fedora disks, I imagine it would run.

salmon5y ago

You bought used RAM DIMMs and were surprised that they failed?

MarkusWandel5y ago

Used computers that have RAM in them. But as I wrote, two of those computers were brand new with new RAMs in them.

otterley5y ago

D. J. Bernstein (of qmail/daemontools fame) spoke of it over a decade ago as well. https://cr.yp.to/hardware/ecc.html

slim5y ago

these days he's more famous for the NaCl crypto library

loup-vaillant5y ago

For which bit flips are even more relevant: EdDSA has this nasty tendency of leaking the private key if the wrong bits are flipped (there are papers on fault injection attacks). People who sign lots of stuff all the time, say Let's Encrypt, could conceivably gain some piece of mind with ECC.

(Note: EdDSA is still much much better than ECDSA, most notably because it's easier to implement correctly.)

linsomniac5y ago

This reminds me of last year we ordered a new $14K server, it arrived and we ran it through our burn-in process which included running memtest86 on it, and it would, after around 7 hours, generate errors.

Support was only interested if their built-in memory tester, which even on it's most thorough, would only run for ~3 hours, would show errors, which it wouldn't. IIRC, the BMC was logging "correctable memory errors", but I may be misremembering that.

"We've run this test on every server we've gotten from you, including several others that were exactly the same config as this, this is the only one that's ever thrown errors". Usually support is really great, but they really didn't care in this case.

We finally contacted sales. "Uh, how long do we have to return this server for a refund?" All of a sudden support was willing to ship us out a replacement memory module (memtest86 identified which slot was having the problem), which resolved the problem.

They were all too willing to have us go to production relying on ECC to handle the memory error.

scottlamb5y ago

> They were all too willing to have us go to production relying on ECC to handle the memory error.

Good call in not accepting this. Even ignoring the possibility you have a double-bit error that causes a crash, or a triple-bit error that maybe can't be detected, frequent ECC errors are problematic. I've encountered machines that consistently ran my software horribly slowly. I don't remember specifics, but let's say at least 100X latency of other machines for similar operations. When I dug in, I found these machines had a huge amount of correctable memory errors. The correction apparently degrades performance significantly. I'm not sure exactly why, but I guess there's an MCE trap to report the memory error, and perhaps that path is slow.

dboreham5y ago

You don't need to look at kernel crashes to speculate about bus and memory errors -- just check the logs on a few systems that do have ecc. Pretty soon you'll see correctable errors being reported.

maddyboo5y ago

I don’t know much about this topic, but is it possible that ECC memory is more prone to single bit errors than non-ECC memory because there is less pressure on companies to minimize such errors? If this were the case, it would skew the data.

justin665y ago

There are 12.5% more memory cells for a given module size, which equals more targets to possibly be flipped by cosmic rays. It’s not crazy to think that modules of equivalent quality (same brand, same chip part numbers) would experience a greater incidence of that kind of single bit flip (which would be corrected on the ECC modules). If a manufacturer were shipping chips prone to bit flipping because of slightly radioactive packaging, as happened at times in the past, you might see something similar.

But you’ve got it backwards about the incentives. A manufacturer has less incentive to deliberately ship a defective part in the case of ECC modules. If the modules consistently log ECC errors, they can easily be identified and returned under warranty to the manufacturer. A consumer is much less likely to identify an intermittent problem with a non-ECC part.

JoeAltmaier5y ago

ECC works if done right. Accessing a memory location can fix bit-flips (ECC is a 'correcting' code). But systems that don't regularly visit every memory location, can accumulate risk. Those dark corners of RAM can eventually get double-bit errors and be uncorrectable. So an OS might 'wash' RAM during idle moments, reading every location in a round-robin manner to get ECC to kick in and auto-correct. Doesn't matter how fast (1M every hour or whatever) as long as somehow ECC has a chance to work.

electricshampo15y ago

Patrol scrub is basically this (https://www.intel.com/content/dam/www/public/us/en/documents... it is built into the memory controller, no OS involvement is needed.

electricshampo15y ago

working link:

https://www.intel.com/content/dam/www/public/us/en/documents...

musingsole5y ago

A double-bit error in many cases is fine. If the error is at least detectable at the time of a read, your protection worked. What's scary is a triple-flip event. Most of those will still look like corrupted data, but if it happens to flip into looking like a fixable, single-bit error, you're out of luck and won't even know it.

a13692099935y ago

> Most of those will still look like corrupted data,

Not if you're using a typical 72-bit SECDED code[0].

You have two error indicators: a summary parity bit (even number of errors: 0,2,etc vs odd number of errors: 1,etc), and a error index: 0 for no errors, or the bitwise xor of the locations each bit error.

For a triple error at bits a,b, and c, you'll have summary parity of 1 (odd number of errors, assumed to be 1), and a error index of a^b^c, in the range 0..127, of which 0..71[1] (56.25%, a clear albeit not overwhelming majority) will correspond to legitimate single-bit errors.

0: https://en.wikipedia.org/wiki/Hamming_code#Hamming_codes_wit...

1: or 72 out of 128 anyway; the active bits might not all be assigned contiguous indexes starting from zero, but it doesn't change the probability and it's simpler to analyse if summary is bit 0 and index bit i is substrate bit 2^i.

1 more reply

temac5y ago

The RAM already needs to be refreshed and IIRC it is done by the memory controller when not in sleep mode.

However I don't remember if there are provisions for ECC checking in case there are some dedicated refresh commands. I hope so, but I'm not sure.

jacquesm5y ago

Interesting, similar to scrubbing raid arrays. How often do those double bitflips appear though? You'd have to have a pretty long running server for that to be a problem, no?

jeffbee5y ago

According to Google's old paper on the subject, about 1% of their machines suffered from an uncorrectable (i.e. multi-bit) error in a year.

jkuria5y ago

For those, like me, wondering what ECC is, here's an explanation:

https://www.tomshardware.com/reviews/ecc-memory-ram-glossary...

KingMachiavelli5y ago

Is there such a thing as 'software' ECC where a segment in memory also has a checksum stored in memory and the CPU just verifies it when the memory segment is accessed?

It would be a lot slower than real ECC but it could just be used for operations that would be especially vulnerable to bit flips. It would also not know for certain if the memory segment of data or the memory segment holding the checksum was corrupted besides their relative sizes (checksum is much smaller so more unlikely to have had a bit flip in it's memory region).

a13692099935y ago

Actually... there is a word of memory that you already have to read every time you access a region of memory: the page table entry for that region. If you have 64-byte cache lines, that's 64 lines per (4KB) page, so you could load a second 64-bit word from the page table[0], and use that as a parity bit for each cache line, storing it back on write the same way you store active and dirty bits in the PTE proper. Actual E[correcting]C would require inflating the effective PTEs from 8(orginal)-16(parity) bytes to about 64(7 bits per line, insufficient)-128(15, excessive), which is probably untenable, but you could at least get parity checks this way.

There's also the obvious tactic of just storing every logical 64-bit word as 128 bits of physical memory, which gives you room for all kinds of crap[1], at the expense of halving your effective memory and memory bandwidth.

0: This is extremely cheap since you're loading a 64- vs 128-bit value, with no extra round trip time and still fits in a cache line, so you're likely just paying extra memory use from larger page tables.

1: Offhand, I think you could fit triple or even quadruple error correction into that kind of space (there's room for eight layers of SECDED, but I don't remember how well bit-level ECC scales).

temac5y ago

Intel has some recent patents on that.

freeqaz5y ago

I bought ECC RAM for my laptop and it definitely was about 4x the price. It's valuable to me for a few reasons -- peace of mind being a big one.

Bit flips happen and are real. I really wish ECC was plentiful and not brutally expensive!

bitcharmer5y ago

This is the first time I hear about a laptop that supports ECC memory. Could you please share the make and model?

dijit5y ago

I have a Dell Precision 5520 (chassis of an XPS 15) which has a Xeon and ECC memory.

Finding a memory upgrade seems difficult though.

1 more reply

lb1lf5y ago

-My boss has a Xeon Dell - a 7550, methinks - luggable.

It is filled to the gunwales with ECC RAM.

Cost him the equivalent of $7k or so. Eeek.

xxs5y ago

Lenovo has Xeon laptops[0], and technically Intel used to support ECC on i3 (and celeron, etc.)

0: https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-p/Thi...

bluedino5y ago

Lenovo (P series) and HP workstation models also support ECC

temac5y ago

Note that the price is mostly due to market segmentation, in your case most of it by the laptop vendor (of course some for Intel, but not that much compared to the laptop vendor)

Xeon with ECC are not that overpriced compared with similar Core without. Likewise, RAM sticks with ECC are cheap to produce (basically just one more chip to populate per side per module). Likewise soldered RAM would simply add maybe $10 or $20 of extra chips.

washadjeffmad5y ago

For the price, it made more sense for me to buy an R630 and populate it with a few less expensive, higher capacity ECC RDIMMs. I don't really need ECC as a local feature, so this lets me run on the mobile I want.

jjeaff5y ago

You should be able to check logs for corrected errors, right?

I'm guessing you won't find any.

phh5y ago

I don't know if ECC is that important, but reliability of RAM (or any storage) feels pretty crazy to me. 128GB being refreshed every second for a month error requires that the per-bit refresh process has a reliability of 99.9999999999999999% to be flawless. Considering we are dealing with quantum effects (which are inherently probabilistic), I wouldn't trust myself to design anything like that.

Now back to ECC, I'll probably be corrected, but I don't think ECC helps gain more than two order of magnitudes, so we still need incredibly reliable RAM. If we move to ECC RAM by default everywhere, aren't we simply going to get less reliable RAM at the end?

bitcharmer5y ago

A system on Earth, at sea level, with 4 GB of RAM has a 96% percent chance of having a bit error in three days without ECC RAM. With ECC RAM, that goes down to 1.67e-10 or about one chance in six billions.

So I'd say ECC is not only important but insanely impactful. There's a reason why many organizations don't even want to hear about getting rigs with non-ECC memory.

tomxor5y ago

I like when people back up their claims with numbers, but would you mind describing roughly what that 96% probability of error is based upon?

I understand altitude has some kind of proportionality to cosmic ray exposure, and number of bits will multiply the probability of an error.. I'm presuming there is also an inherent error rate to DRAM separate from environment. But what are those numbers.

1 more reply

dejj5y ago

And even higher in the vicinity of radioactive cattle: https://www.jakepoz.com/debugging-behind-the-iron-curtain/

johndough5y ago

I ran a memory test for two weeks straight on a consumer laptop with 8 GB RAM and could not get a single bit flip, so your mileage may vary.

1 more reply

gzalo5y ago

That number is flawed, and the author did a follow-up with better results: http://lambda-diode.com/opinion/ecc-memory-2

"33 to 600 days to get a 96% chance of getting a bit error." Still, it seems way too high. I guess anyone with ECC RAM could confirm that they are getting those sort of recovered error rates?

davidw5y ago

Could you measure altitude with memory?

3 more replies

mrlala5y ago

So, I hear what you are saying. But, on the other hand, I have been using 2 non-ECC desktops for a workstation/server for the past ~6 years.. and I would be hard pressed to come up with a single situation where either of the machines randomly crashed or applications did anything 'unexpected' (to my knowledge, of course).

My point is, when you say there is a "96% chance of having an error in THREE DAYS", one would EXPECT to be having issues like.. all the time? So I'm not disagreeing with you, but with the amount of non-ECC machines all over the world and how insanely stable modern machines are, it still seems like a very low risk.

Now of course I agree that if you want to take every precaution, go ECC, but simple observation prove that this "problem" can't be as bad as the numbers are saying.

1 more reply

formerly_proven5y ago

RAM is not as reliable as you think. Some ECC memory hardly ever finds an error, some machines see them at a very consistent rate, e.g. 50 errors per TB-day. That would translate to 1-2 errors per day in a 32 GB PC. Without ECC you cannot know in which bucket you are.

trevyn5y ago

If true, that seems like... a very straightforward bucket to test if you’re in.

1 more reply

johnklos5y ago

From the fortune database:

As far as we know, our computer has never had an undetected error. -- Weisert

londons_explore5y ago

I simply care that my computer executes code perfectly. Let's settle on "one instance of unintended behaviour per hundred years" for that metric.

If it needs ECC memory to do that, then fit it with ECC memory. If there are other ways to achieve that (for example deeper dram cells to be more robust to cosmic rays) that's fine too.

Just meet the reliability spec - I don't care how.

simias5y ago

Then you'll have to pay a huge primer for that privilege. I can assure you that your standard computer components are not rated for century-scale use.

That's why I've always been on the fence with this ECC thing. For servers it's vital because you need stability and security.

For desktops I think that for a long time it was fine without ECC. If I have to chose between having, say, 30% more RAM or avoid a potential crash once a year, I'll probably take the additional RAM.

The problem is that now these problem can be exploited by malicious code instead of just merely happening because of cosmic rays. That's the main argument in favour of ECC IMO, the rest is just a tradeoff to consider.

ClumsyPilot5y ago

But it isn't just a crash, it's also silent data corruption that will never be detected

2 more replies

loup-vaillant5y ago

> I can assure you that your standard computer components are not rated for century-scale use.

And that's probably not what GP asked for. There's a difference between guaranteeing an error rate of 1 error per century of use on average, and guaranteeing it over the course of an actual century. It might be okay to guarantee that error rate for only 5 years of uninterrupted use, and degrade after that. For instance:

  Years  1- 5:  1 error  per century.
  Years  6-10:  3 errors per century.
  Years 10-15: 10 errors per century.
  Years 15-20: 20 errors per century.
  Years 20-30:  1 error  per *year*.
  Years 30+  : the chip is broken.

Now, given how energy hungry and polluting the whole computer industry actually is, it might be a good idea to shoot for extreme durability and reliability anyway. Say, sustain 1 error per century, over the course of fifty years. It will be slower and more expensive, but at least it won't burn the planet as fast as our current electronics.

temac5y ago

In "theory" it needs ECC because you must also protect the link between the CPU and the RAM. So with ECC fully in DRAM but no protection on the bus, you risk some errors during the transfer. However maybe this kind of errors are rare enough so that you would have less than one per century. It probably depends on the motherboard design and fabrication quality though, and the environment where it is used.

paulie_a5y ago

There was a great defcon talk a while back regarding using ECC. The concept was called "dns jitter"

Basically you can register domains using small bit differences for domains and start getting email and such for that domain

If I recall correctly the example given was a variation of microsoft.com

All because so much equipment doesn't use ECC

jeffbee5y ago

miclosoft.com is only one bit away from microsoft.com. Used to see these problems all the time when I worked on gmail.

At Google even with ECC everywhere there wasn't enough systematic error detection and correction to prevent the global database of monitoring metrics from filling up with garbage. /rpc/server/count was supposed to exist but also in there would be /lpc/server/count and /rpc/sdrver/count and every other thing. Reminded me daily of the terrors of flipped bits.

thu21115y ago

Ahaha. Reminds me of when I worked there. One day a large service tanked in some datacenter because BigTable replication in that location just stopped. Digging in, it turned out the BigTable should have been replicating from YQ but had started trying to use QQ instead, which didn't exist. Q being one bit away from Y. Or it was something like that, I don't remember exactly. There'd been a bit flip in the exact part of memory that contained the name of the database cluster to replicate from!

1 more reply

zx2c45y ago

Voila http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabur...

tyoma5y ago

There were some great follow up talks as well! It turns out a viable attack vector was also MX records. And there was the guy who registered kremlin.re ( versus kremlin.ru ).

MAXPOOL5y ago

Well shit.

I run some large ML models in my home PC and I get NaN's and some out of range floats every month or so. I have spent hours debugging but doing the same computation with the same random seeds does not recreate the problem.

How about GPU's and their GDDR SDRAM? Do they have parity bits?

layer85y ago

Some pro-level Nvidia GPUs have ECC RAM, they are very expensive though. I don’t think regular gaming GPUs have parity, due to the extra cost, performance impact (probably minor but measurable) and irrelevance for gaming.

vbezhenar5y ago

Cheap pro-level GPUs don't have ECC RAM either. And it's not easy to find out, it might be buried somewhere.

spacedcowboy5y ago

Seems likely that “bad ram” was the reason for the recent AT&T fiber issues, given that 1 bit was being flipped reliably in data packets [1]

[1]: https://twitter.com/catfish_man/status/1335373029245775872?l...

p_l5y ago

I have had in the past encountered an issue where line card was stripping exactly one bit of address data. Don't know of the follow up investigation, but it probably wasn't TCAM

SV_BubbleTime5y ago

I think you meant seems unlikely

louwrentius5y ago

ECC matters, even on the desktop, it's not even a discussion, to me.

If you think it doesn't matter: how do you know? If you don't run with ECC memory, you'll never know if memory was corrupted (and recovered).

That blue screen, that sudden reboot, that program crashing. That corrupted picture of your kid.

Who knows.

I'll tell you, who knows. God damn every sysadmin (or the modern equivalent) can tell you how often they get ECC errors. And at even a small scale you'll encounter them. I have, on servers and even on an SAN Storage controller, for crying out loud.

If you care about your data, use ECC memory in your computers.

supernovae5y ago

I've got nearly 30 years of experience and not once has non ECC memory lead to corruption. Maybe a crash, maybe a panic, maybe a kernel dump...

But.. in all my time operating servers over 3 decades, it's always been bad drivers, bad code and problematic hardware that's caused most of my headaches.

Have i seen ECC error correction in logs? yeah.. I don't advocate against it but, i've found for most people you design around multiple failure scenarios more than you design around preventing specific ones.

Take the average web app - you run it on 10 commodity systems and distribute the load.. if one crashes, so what. Chances are, a node will crash for many more reasons other than memory issues.

If you have an app that requires massive amounts of ram or you do put all of your begs in one basket, then ECC makes sense...

I just know i like going horizontal and I avoid vertical monoliths.

ptx5y ago

> if one crashes, so what

Crashes might not matter, but silent data corruption does. The owner/user of that data will care when they eventually discover that it at some point mysteriously got corrupted.

ajnin5y ago

> I've got nearly 30 years of experience and not once has non ECC memory lead to corruption

How do you know?

1 more reply

louwrentius5y ago

The problem with memory corruption is not just crashes, those are the more benign outcomes.

The real killer is data corruption. Houw would you even begin to know that data is corrupted until it is too late?

1 more reply

alkonaut5y ago

I know what it does, but I still don’t care (so long as it costs money or even 1% performance).

It’s a tradeoff between money/performance and the frequency of crashes, corruption etc.

Bit rot is just one of many threats to my data. Backups take care of that as well as other threats like theft, fire, accidental deletion.

This is similar to my reasoning around the recent side channel attacks on intel CPUs. If I had a choice I’d like to run with max performance without the security fixes even though it would be less secure. Not because I don’t care about security but because 1% or 5% perf is a lot and I’d rather simply avoid doing anything security critical on the machine entirely than take that hit.

louwrentius5y ago

> Bit rot is just one of many threats to my data. Backups take care of that as well as other threats like theft, fire, accidental deletion.

No, that's the big mistake people make: backups just backup bit-rotted data, until it is too late and the last good version is rotated out and lost forever.

1 more reply

mark-r5y ago

Backups can't fix what was already corrupted when it was written to disk.

knorker5y ago

I have multiple times postponed buying new computers for YEARS, because I'm waiting for intel to get their head out of their ass and actually let me buy something that does ECC for desktop. (incl laptops)

I would have bought computers when I "wanted one". Now I buy them when I need one. Because buying a non-ECC computer just feels like buying a defective product.

In the last 10 years I would have bought TWICE as many computers if they hadn't segmented their market.

Fuck intel. I sense that Linus self-censored himself in this post, and like me is even angrier than the text implies.

skibbityboop5y ago

Have you finally stopped buying Intel? Current Ryzens are a much better CPU anyhow, just dump Intel and be happy with your ECC and everything else.

knorker5y ago

I'm in the market for a new laptop (since a few years). Is there something like the X1 carbon but with ECC?

vbezhenar5y ago

There are plenty of Xeons which are suitable for desktops and there are plenty of laptops with Xeons.

Price is not nice though.

19965y ago

Linus is absolutely right.

I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM. I can't get that, at all - even without the other fancy things I would like such as a 4k OLED with pen/touchscreen.

In 2020, even the Dell XPS stopped shipping OLED (goodbye dear 7390!)

I will gladly give my money to anyone who sells AMD laptop with ECC. Hopefully, it will show there's demand for "high end yet non bulky laptops"

miahi5y ago

Lenovo P53 has 3 NVMe slots, 4k OLED with touchscreen (and optional pen) and up to 128GB ECC RAM if you choose the Xeon processor. It's big and heavy, but it exists.

I hope AMD will create a better market for the ECC laptop memory (right now it's hard to find + expensive).

19965y ago

I know- I had my eye on this very model, as you can even add a mSata on the WWAN slot to get a 4th drive.

Unfortunately, Lenovo is not selling the P53 anymore, which is exactly why I say I can't get that even in a "bulky" version.

IgorPartola5y ago

I wish this was more of a cohesive argument. He says he thinks it’s important and points to row-hammer problems but doesn’t explain why. Probably because the audience it was written for already knows the arguments of why, but this is not the best argument.

If in doubt, get ECC. Do your own research on how it works and why. This post won’t explain it, just will blame Intel (probably rightfully so).

eloy5y ago

He does explain it:

> We have decades of odd random kernel oopses that could never be explained and were likely due to bad memory. And if it causes a kernel oops, I can guarantee that there are several orders of magnitude more cases where it just caused a bit-flip that just never ended up being so critical.

It might be false, but I think it's a reasonable assumption.

IgorPartola5y ago

To someone on HN who isn’t familiar with what ECC does that explains nothing about how ECC works and how it could have prevented these situations. Or how often they really happen.

3 more replies

turminal5y ago

It's a message in a thread from a technological forum. I think its intended audience are people already familiar with ECC unlike here on HN.

IgorPartola5y ago

Exactly my point :)

tgbugs5y ago

A relevant Bryan Cantrill talk segment on this, which heightens the paranoia around this. Namely, firmware hiding correctable errors and only reporting uncorrectable errors.

https://www.youtube.com/watch?t=2104&v=fE2KDzZaxvE

type05y ago

Consumer awareness about ECC needs to be better, with recent security implications I simply can't understand why more motherboard manufacturers don't support it on AMD. Intel of course is all to blame on the blue side, I stopped buying their overpriced Xeons because of this.

rajesh-sOP5y ago

Good point on the need for awareness!

The industry has convinced the average user of consumer hardware that PPA (Power,Performance,Area) is all that needs to get better with generational improvements. Hoping that the concerning aspects of security and reliability that have come to light in the recent past changes this.

kozak5y ago

I'm about to write some code that will allocate a random buffer, take a checksum of it, and just sit on the buffer, periodically checksuming it again until a bit flips. Or maybe even allocate a buffer of zeros and wait until a non-zero appears in it.

FartyMcFarter5y ago

Does anyone know why ECC memory requires the CPU to support it?

Naively, I can understand why error reporting has dependencies on other parts of the system, but it would seem possible for error correction to work transparently.

toast05y ago

As implemented today, ECC is a feature of the memory controller. You need special ram, because instead of 8 parallel rams per bank, you need 9, and all the extra data lines to go to the controller.

Modern CPUs have integrated memory controllers, so that's why the CPU needs to support it.

Correction without reporting isn't great; anyway, you need a reporting mechanism for uncorrectable errors, or all you've done is ensure any memory errors you do experience are worse.

fomine35y ago

Error correcting and reporting is better, but even only correcting is better than non-ECC. I wonder this compromise could be accepted by Intel.

1 more reply

TomVDB5y ago

I think the memory just provides additional storage bits to detect the issue, but doesn't contain the logic.

This is in line with all technical parameters of DRAM: everything must be as cheap as possible, and all the difficult parts are moved to the memory controller.

Which is the right thing to do, because you can share one memory controller with multiple DRAM chips.

wmf5y ago

Historically the detection and correction is performed in the memory controller not the DRAM.

vlovich1235y ago

A couple of years ago there was advancements that claimed to make Rowhammer work on ECC RAM even with DDR4 [1]. Is that no longer a concern for some reason?

I would think the only guaranteed solutions to Rowhammer are actually cryptographic digests and/or guard pages.

[1] https://www.zdnet.com/article/rowhammer-attacks-can-now-bypa...

theevilsharpie5y ago

ECC isn't a direct mitigation against Rowhammer attacks, as memory errors caused by three or more flipped bits would still go undetected (unless you're using ChipKill, but that's a rare setup).

However, flipped three bits simultaneously isn't trivial, and the attempts that flip fewer bits will be detected and logged.

GregarianChild5y ago

Isn't ChipKill just another form of ECC? If so there is a number of bitflips that ChipKill can no longer correct / detect. [1] seems to say that they observed some flips in dRAM with ChipKill, although the paper is a bit vague here.

[1] B. Schroeder et al, DRAM Errors in the Wild: A Large-Scale Field Study http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

rajesh-sOP5y ago

Right! Section 1.3 of this publication discusses possible mitigations for the row hammer problem and where ECC fits in

https://users.ece.cmu.edu/~omutlu/pub/rowhammer-summary.pdf

1 more reply

amelius5y ago

Does Apple use ECC in its M1 laptop?

dijit5y ago

No. It uses a unified package of LPDDR4x SDRAM

my1235y ago

LPDDR4X systems with ECC exist, but it indeed looks like Apple M1 systems aren't one...

graeme5y ago

This is my one worry. I have an imac pro and anecdotally it has been a LOT more reliable than my old macbook pro. The imac pro has ecc.

alexwillner5y ago

At least some kernel log messages imply that the M1 might support ECC:

https://eclecticlight.co/2020/12/09/what-happens-when-an-m1-...

greyhair5y ago

ECC is required on mission critical hardware.

I have spent 36 years fielding embedded devices in core network (D1/E1, SONET, ROADM/MPLS, Cellular basestation) and I will tell you that large ECC covered memory arrays always show small numbers of correctable error events over the course of a year. I have seen, over the course of my career, exactly one controller card replaced early in the field, because it started throwing excessive recoverable ECC events over time, until it hit a threshold of 10x the average of a typical board. On the order of ten recoverable ECC events per month instead of one event per month. I have never observed a logged non-correctable ECC event in the field. In the lab, yes, but never in fielded equipment.

If you are fine with your PC experiencing one or two bits flipped in memory every month, then you really don't need ECC. That is the question you need to answer.

For mission critical systems? ECC is a requirement.

willis9365y ago

Whenever this topic comes up I wonder how much more resilient are CPU registers compared to DRAM.

MisterTea5y ago

> ECC availability matters a lot - exactly because Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation.

The phrase that strikes me is "horribly bad market segmentation". I agree 100%.

Remember when the Pentium/pro/2/3 could operate in single and dual socket configurations with ECC? The same CPU that plugged into your low end consumer board could also plug into a high end server/workstation board. All you needed was the right motherboard.

_0ffh5y ago

Please someone correct me if I'm wrong, but as far as I can remember memory with extra capacity for error detection used to be a rather common thing on early PCs. That really only changed a couple of decades in, in order to be able to offer lower prices to home users who didn't know or care about the difference. Probably about the time, or earlier, when with some hard disk manufacturers megabytes suddenly shrunk to 10^6 bytes (before kibibytes or mebibytes where a thing, btw).

wmf5y ago

Yes, PCs used to use parity memory.

_0ffh5y ago

That's the name I couldn't quite recover from my memory when I asked, exactly!

wicket5y ago

Over the years, I don't think I've ever been able to explain to anyone that their memory error could have been caused a cosmic ray without being laughed at.

mauri8705y ago

In case the page os not loading, refer to the wayback machine[1] for a copy

[1] https://web.archive.org/web/*/https://www.realworldtech.com/...

jhoechtl5y ago

I definitely do not want Linus Torvalds yelling at me in that tone --- but reading his utterings is certainly entertaining.

aborsy5y ago

For the average user, what’s the impact of bit flips in memory in practical terms?

I am not talking about servers dealing with critical data.

Suppose that I maintain a repository (documents, audio and video), one copy in a ZFS-ECC system and one in an ext4-nonECC system.

Would I notice a difference between these two copies after 5-10 years?

That tells us if ECC matters for most people.

theevilsharpie5y ago

> For the average user, what’s the impact of bit flips in memory in practical terms?

The most likely impact (other than nothing, if bits are flipped in unused memory) is program crashes or system lock-ups for no apparent reason.

throwaway98705y ago

This isn't about disk storage, this is about DRAM. A bit flip in DRAM might corrupt data, but could also cause random crashes and system hangs. That generally matters to everyone.

1 more reply

arendtio5y ago

It would be interesting to see how many more kernel oops appear on machines without ECC compared to those with ECC.

indolering5y ago

My favorite example is a bit flip altering election results:

https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...

trissylegs5y ago

When I chose my PC parts when Ryzen first came out I tried to get ECC parts. The RAM was obtainable, the problem was that no motherboards had ECC support at the time. I hope the situation has improved by the time I get my next motherboard/cpu upgrade.

elgfare5y ago

For those out of the loop like me, ECC does indeed stand for error correcting code. https://en.m.wikipedia.org/wiki/ECC_memory

nix235y ago

I always have that conversation when ZFS comes up. Some peoples think ZFS NEEDS ECC, but in fact ZFS needs ECC much as every single one FS in Linux. And every single reliable Machine needs ECC.

ratiolat5y ago

I have: Asus PRIME A520M-K Motherboard 2x M391A2K43DB1-CVF (Samsung 16GiB ECC Unbuffered RAM) AMD Ryzen 5 3600

I specifically was looking for bang for buck, low(er) wattage and ECC.

IanCutress5y ago

Those AMD motherboards with consumer CPUs are a bit iffy. They run ECC memory, but it's hard to tell if it is running in ECC mode. Even some of the tools that identify ECC is running will say it is, even when it isn't, because the motherboard will report it is, even when it isn't. ECC isn't a qualified metric on the consumer boards, hence all the confusion.

unixhero5y ago

Fantastic burn by Linus Torvalds whom also had some skin in the CPU game.

Offtopic, I wonder if he trawls that site regularly. And eventually I wonder, is he here also? :)

Noxmiles5y ago

I was reading it and thought: wow, this guy is absolutely right! Great things he's talking about. After reading it, i saw it was Linux Torvalds :D

raghavtoshniwal5y ago

Once trained a GPT2 model to do text-gen on Linus’ emails. Boy there were some choice angry rants and non-sensical technical jargon that was generated

z3t45y ago

Memory often comes with lifetime guarantees. If they had ECC it would be much easier to detect bad memory...

JumpCrisscross5y ago

What is the status of ECC on Macs?

CalChris5y ago

iMac Pro which has Xeon M. There's a good chance that will go away with the new Apple Silicon iMac Pro due out this year. MacRumors roundup article doesn't mention ECC.

https://www.macrumors.com/roundup/imac/

qwerty4561275y ago

ECC should be everywhere. It seems outrageous to me almost no laptops have ECC.

belzebalex5y ago

Asked myself, would it be possible to build a Geiger counter with RAM?

rafaelturk5y ago

Little bit offtopic: Again seems that Intel? what?! is the one lowering the bar.

b0rsuk5y ago

I browsed some online listings for ECC memory modules, and they seem to be sold one module at a time. Standard DDR4 modules are sold in pairs, to benefit from dual channel mode.

Does ECC memory support dual channel??

srtjstjsj5y ago

I guess Linus's recent project to communicate more respectfully didn't pan out.

musingsole5y ago

It's a shame we don't have ECC for individuals. How many of society's bugs come from someone wandering around with a bit flipped?

rahimiali5y ago

I have trouble parsing information from this rant. Is someone willing to translate this into an argument (a string of facts tied by logical steps)?

mark-r5y ago

1. Linux sometimes has crashes, not due to software errors but because of memory glitches. 2. ECC would prevent memory glitches. 3. ECC is hard to find on desktop PCs because Intel uses the feature to differentiate desktop CPUs from server CPUs, so it can charge more for servers. 4. Even when someone like AMD makes the feature available, the market doesn't have ECC DRAM modules or motherboards readily available because Intel killed the demand for it.

wagslane5y ago

It really does. I did a write-up recently on it as I was diving in and understanding the benefits: https://qvault.io/2020/09/17/very-basic-intro-to-elliptic-cu...

avianes5y ago

Be careful not to confuse ECC memory with ECC encryption.

ECC memory = memory with Error-Correcting Code

ECC encryption = Elliptic Curve Cryptography

sally16205y ago

Linux is accusing Intel of killing ECC intentionally. But that is not really the case, they just wanted people to pay up.

If you care about ECC, you pay for Xeon. Majority of consumers don't run critical applications on their devices, so they are happy with a cheap device that may crash once in a while.

AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.

Honestly, what is the point of bullet-proof hardware when the software reliability (at least on consumer devices) has gone down to two nines.

Dylan168075y ago

Intel had to kill consumer ECC as part of making it a feature that people can "pay up" for. That's very intentional.

> AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.

You are correct to call them a corporation. AMD is not your friend, but they are the good actor in this fight.

fomine35y ago

ECC isn't enough to be bulletproof but improves reliability for well known relatively unreliable parts. Extra theoretical cost for ECC should be accepted for most of computer users. It also helps developing cheaper RAM technology (see what's happened on SSD).

sys_647385y ago

ECC memory is predominantly used in servers where failure absolutely must be identified and logged. The desktop market to a lesser extent due to lack of mission critical tasks being run from there.

dijit5y ago

There are situations though, where you’re working on a document and the documents “save” format is a memory dump. Corruption for things of that type (Adobe RAW for example) would remove data.

It might present itself as a 1pixel colour difference, but it could be more damaging (incorrect finances, in accounting software for example). Software trusts memory; but memory can lie.

That’s dangerous.

MaxBarraclough5y ago

That's an interesting point. In an extreme case, an order or money transfer might be placed for an incorrect quantity, or to an incorrect recipient.

1 more reply

projektfu5y ago

Perhaps consumer-grade software that needs guarantees of correctness should be using error correction in software. For example, database records for financial software, DNS, e-mail addresses, etc.

jkbbwr5y ago

To be fair, if your save mechanism is just a straight memory dump with no checksums and validation. You have bigger issues.

2 more replies

sys_647385y ago

Those corner cases might occur rarely but are probably inconsequential given rate of occurrence versus rate of criticalness - it probably doesn't justify the markup for most. In a data center you're processing millions of transactions per minute so occurrence is much more impactful.

2 more replies

j / k navigate · click thread line to collapse

550 comments

nostrademons5y ago

I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

ksec5y ago

>I still remember Craig Silverstein being asked what his biggest mistake at Google was and him answering "Not pushing for ECC memory."

[1] https://blog.codinghorror.com/building-a-computer-the-google...

[2] https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

[3] https://danluu.com/why-ecc/

djur5y ago

1 more reply

sitkack5y ago

The fact that ECC isn't the default across everything is a failure of human cognition and Capitalism.

2 more replies

starfallg5y ago

https://semiengineering.com/what-designers-need-to-know-abou...

There has been a lot of debate regarding this that was summarised in this post -

https://blog.codinghorror.com/to-ecc-or-not-to-ecc/

wtallis5y ago

> This is present on some DDR4 and DDR5 modules, but information on this is kept internal by the DRAM makers and not usually public.

1 more reply

tyoma5y ago

Figure this is as good of a time as any to ask this:

There are many various DRAMs in a server (say, for disk cache). Has Google or anyone who operates at a similar scale seen single bit errors in these components?

bsder5y ago

This is as old as computing and predates Google.

When America Online was buying EV6 servers as fast as DEC could produce them, they used to see about about 1 double bit error per day across their server farm that would reboot the whole machine.

DRAM has only gotten worse--not better.

gh02t5y ago

The supercomputing community has looked at some of the effect on different parts of the GPU.

https://ieeexplore.ieee.org/abstract/document/7056044

sitkack5y ago

Yes.

Bit flips (for all reasons) occur in buses, registers, caches, etc. Anything that has state can have state changed incorrectly.

This is why filesystems like ZFS exist and storage formats have pervasive checksums.

itisit5y ago

New Yorker article that credits Jeff Dean and Sanjay Ghemawat with discovering the company’s bitflip issue:

https://www.newyorker.com/magazine/2018/12/10/the-friendship...

grishka5y ago

I remember reading how someone registered some google domains with a single bit flipped, and saw actual requests coming to them.

andrewstuart25y ago

If you or anybody can remember the source article, that sounds like an interesting read!

Edit: found one with a quick search. https://nakedsecurity.sophos.com/2011/08/10/bh-2011-bit-squa...

And https://www.researchgate.net/publication/262273269_Bitsquatt...

2 more replies

Faaak5y ago

NickNameNick5y ago

That works for any domain that's busy enough.

Random bit flips happen on client machines and on routers.

If there are enough requests for a domain name, some of those requests will be subject to one one of those bit-flips.

saagarjha5y ago

That might just be typos in some cases?

gigatexal5y ago

colejohnson665y ago

But if you can save $1M+ now, then throw the cost of fixing it onto the person who replaces you, why do you care? You already got your bonus and jumped ship.

1 more reply

finiteloop5y ago

One of the best quotes in the Google quotes file an early Googler maintained (I am sure I am screwing it up):

“I’ve heard of defensive programming, but never adversarial memory.” — Ben Gomes

fragmede5y ago

Close!

> I've never thought of defensive programming in terms of adversarial memory.

maria_weber235y ago

giantrobot5y ago

1 more reply

AaronFriel5y ago

It can't eliminate it but:

I sincerely hope Google is not simply silencing bitflip corrections and detections. That would be a profound waste.

1 more reply

saagarjha5y ago

1 more reply

tomxor5y ago

> Making software resilient against bitflips in memory seems very difficult though, since it not only affects data, but also code.

2 more replies

slumdev5y ago

Error-correcting code (the "ECC" in ECC) is just a quorum at the bit level.

2 more replies

DSingularity5y ago

1 more reply

hn33335y ago

1 more reply

colejohnson665y ago

> You probably would have to use multiple computers doing the same calculation and then take the answer from the quorum.

The Apollo missions (or was it the Space Shuttle?) did this. They had redundant computers that would work with each other to determine the “true” answer.

4 more replies

sobriquet95y ago

1 more reply

dijit5y ago

I beg this, every time this conversation comes up it’s the same answer “I don’t see a problem”.

My own experience tracking bitflips across my fleet led me to buy a Xeon laptop with ECC memory (precision 5520) and it has (anecdotally) been significantly more reliable than my desktop.

loeg5y ago

I love that AMD doesn't intentionally break ECC on its consumer desktop platforms and upgraded to the Threadripper in 2017.

defanor5y ago

9 more replies

CalChris5y ago

1 more reply

fortran775y ago

While it's true that Intel only has ECC support on Xeon (and several other chips targeted at the embedded market) it's not true that ECC is supported well on AMD.

We only use Xeons on developer desktops and production machines here precisely because of ECC. It's about 1 bit flip/month/gigabyte. That's too much risk when doing something critical for a client.

3 more replies

rhn_mk15y ago

Doesn't intel make ECC available on the i3 line of CPUs?

3 more replies

vbezhenar5y ago

You can find non-Xeons with ECC support. But they are rare and usually suitable for some kinds of micro servers.

derefr5y ago

Were you around for enough DRAM generations to notice an effect of DRAM density / cell-size on reported ECC error rate?

(I’d just ask for actual numbers on whether per-cell per-second errors have increased over the years, but I don’t expect anyone has them.)

mlyle5y ago

I think it's been quadratic with a pretty low contribution from the order 2 term.

That is, it's flip event rate * memory die area, not flip event rate * number of memory bits.

And sure, your other effect is true: there's a whole lot fewer bitflips that "matter". Flip a bit in some framebuffer used in compositing somewhere-- and that's a lot of my memory-- and I don't care.

jeffreygoesto5y ago

incrudible5y ago

When you say bitflips were "common" on thousands of physical machines, does that mean you observed thousands of bitflips?

Otherwise, I would think that an unlikely event becoming 1000x more likely by sheer numbers would have warped your perception.

felixhandte5y ago

Yes. I work at Facebook on data compression.

So yes. It happens. Pretty frequently, in the sense that if you're doing xillions of operations a day, a one-in-a-xillion failure happens all the time.

[0] https://github.com/facebook/zstd/tree/dev/contrib/diagnose_c...

adrian_b5y ago

Modern memories have a shorter lifetime than old memories and very frequently they begin to have bit errors from time to time long before breaking down completely.

2 more replies

tyoma5y ago

It depends where the failure happens. Sometimes you really lose the “failure in the wrong place” lottery. For example, in a domain name: http://dinaburg.org/bitsquatting.html

dkersten5y ago

Another comment[1] mentioned 1 bitflip per gigabyte per month. If you have a lot of RAM, that's rather a lot.

> It doesn't matter whether a bitflip (unlikely) or some bug (likely) causes a node to spuriously fail

Except that a bitflip can go undetected. It may crash your software or system, but it also may simply leak errors into your data, which can be far more catastrophic.

[1] https://news.ycombinator.com/item?id=25623206

2 more replies

ikiris5y ago

Its enough that graphs can show you solar weather.

I can't give my source, but its far higher than most people think. Just pay the money.

jjeaff5y ago

2 more replies

smoyer5y ago

hosteur5y ago

How did you track memory errors across thousands of physical machines?

core-questions5y ago

https://github.com/netdata/netdata/issues/1508

Looks like `mcelog --client` might be a starting place? Feed that into your metrics pipeline and alert on it like anything else...

1 more reply

ikiris5y ago

The same way you do it with everything else, export the telemetry and store it in time series...

lighttower5y ago

Can you get decent battery life with this ecc memory in a laptop?

dijit5y ago

Yes. ECC memory uses only marginally more power than non-ECC memory. And memory isn’t the largest consumer of battery life by a country mile.

Screen, Wi-Fi, and to a much lesser extent (unless under load) the CPU are the most major culprits of low battery life.

1 more reply

cbanek5y ago

As someone who has had to read thousands of random game crash reports from all over the interwebs (you know when Windows says you might want to send that crash log? like that), I totally agree.

And as the timeline goes to infinity, you may also get one of these reports and be asked to fix it... good luck.

TonyTrapp5y ago

One of my favourite stories about this type if stuff is from Patrick Wyatt on developing Guild Wars: https://www.codeofhonor.com/blog/whose-bug-is-this-anyway

See "Your computer is broken". They essentially inserted a stress test into the game that verified if the hardware was still doing calculations correctly, and if not, inform the user.

Springcleaning5y ago

Worse than a game crash is your data.

It is incomprehensible that there are still NAS devices being sold without ECC support.

Synology took a step in the right direction to offer prosumer devices with ECC but it is not really advertised as such. It is actually difficult to find which do have ECC and which ones don't.

ksec5y ago

>Synology took a step in the right direction to offer prosumer devices with ECC

Edit: Only two model from the new xx21 series using AMD Ryzen V has ECC memory by default.

dannyw5y ago

Blame Intel for that. They dropped support for ECC RAM on the new Celerons and i3s.

apankrat5y ago

Aye. I have an assert in the code that fronts a very pedantic test of the context. In all cases when this assert was tripped (and reported) an overnight memtest86 test surfaced RAM issues.

- Edit -

So, if you ever decide if to get an ECC RAM, get it. It's very much worth it.

masklinn5y ago

srtjstjsj5y ago

Bitrot in human memory is the same. Memories change during the process of recalling them, not while they are in "storage".

1 more reply

BlueTemplar5y ago

Yeah, here's one example along many more :

https://forums.factorio.com/viewtopic.php?p=405060#p405060

pkaye5y ago

I wonder how much of those crashes are due to gamers aggressively overclocking their systems?

lighttower5y ago

Someone reads those reports!?! Wow, how do I write them to ensure someone who reads them takes them seriously?

xmodem5y ago

The best way is to submit the same crash report from thousands of different locations, repeatedly

faitswulff5y ago

Do the crash reports include whether the machine has ECC memory?

jacquesm5y ago

On intel consumer boxes it is pretty safe to assume that they don't, on AMD it might be the case but it usually isn't.

jackric5y ago

Do the crash reports include recent solar activity?

2 more replies

Triv8885y ago

most gaming desktops don't use ECC RAM anyways (at least those from a few years ago)

zdw5y ago

Good news is that for DDR5, ECC is a required part of the spec and should be a feature of every module:

https://www.anandtech.com/show/15912/ddr5-specification-rele...

toast05y ago

cududa5y ago

Huh? Why would the memory controller not be updated accordingly? Also I have no idea about Linux or Mac, but Windows has had ECC support and active management for decades?

2 more replies

rajesh-sOP5y ago

A whitepaper on DDR4 ECC by Micron that goes over some of the implementation challenges

https://media-www.micron.com/-/media/client/global/documents...

hinkley5y ago

Is it built in as an added feature, or as the only way to make DDR5 reliable? My inner cynic is screaming the latter.

When the value add feature becomes a necessity, it’s not a value add any more.

CoolGuySteve5y ago

I always wondered why isn't ECC built into the memory controller, the same hardware that runs the bus into L3 or the page mapper could checksum groups of cachelines.

It seems redundant to have every module come with its own checking hardware.

p_l5y ago

ECC is a function of memory controller, not memory, on current systems. There's also usually some form of ECC on whatever passes for system bus, and internal caches have ECC as well.

kasabali5y ago

bradfa5y ago

I read it to say that on die ecc is recommended but that dimm-wide ecc is still optional.

And now you have 8 bits of ecc per 32 data versus older DDR having 8 bits of ecc per 64 data. Hence the cost for dimm-wide ecc is going up.

simias5y ago

But these days with the RAM density being so high and bitflipping attacks being more than a theoretical threat it seems like there's really no good reason not to switch to ECC everywhere.

tokamak-teapot5y ago

Are there any Ryzen boards that support ECC and actually correct errors?

adrian_b5y ago

As others have replied, all the ASRock boards where I have ever checked the specifications do support ECC and also some ASUS boards support ECC, e.g. all ASUS workstation boards.

gruez5y ago

quick search:

https://rog.asus.com/forum/showthread.php?112750-List-Asus-M...

1 more reply

dannyw5y ago

Yes, almost all of them correct single-bit flips and detect but do not correct multiple hit flips.

loeg5y ago

Yes. E.g., all ASRock boards.

fulafel5y ago

The functionality seems to all be in the memory controller integrated to the CPU.

fctorial5y ago

> cheaper/more RAM

It's faster too.

ekianjo5y ago

> no good reason not to switch to ECC everywhere.

Not all CPUs support ECC however.

josefx5y ago

Just Intel fucking over security by making ECC a non feature on consumer grade hardware - wouldn't be surprised if it was just a single bit flipped in a feature mask.

2 more replies

loeg5y ago

(Intel)

otterley5y ago

About 1/3 of Google's machines and 8% of Google's DIMMs in their fleet suffer at least one correctible memory error per year: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

jjeaff5y ago

Nebasuke5y ago

Google does not use very large or even large machines for most of their fleet. You can quickly see in the paper this is for 1, 2, and 4 GB RAM machines (in 2006-2008).

1 more reply

tpetry5y ago

But i don‘t know how relevant these metrics from 2009 are. Did memory got better or worse compared to 2009 for bit flips?

petermcneeley5y ago

I would also add that Row Hammer Attacks are much harder on ECC.

When I first tried to replicate the row hammer attack I was not getting any results. Turns out I was doing this on ECC. On non ECC memory the same test easily replicated the row hammer attack.

https://en.wikipedia.org/wiki/Row_hammer

kensai5y ago

“ECC availability matters a lot - exactly because Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation.”

Its.

There, I finally corrected Linus Torvalds in something. :))

JosephRedfern5y ago

Maybe he composed the message using a machine with non-ECC RAM and suffered a bit flip, which through some chain of events, led to the ' being added. Best to give him the benefit of doubt, I think!

notretarded5y ago

The mistake was that it was included.

1 more reply

jacquesm5y ago

How is your Finnish?

jankeymeulen5y ago

Or Swedish for that matter, as I believe Torvalds maternal language is Swedish

3 more replies

xxs5y ago

Linus must have English as his '1st' language now. For non-originally-native speaker mistakes like 'it's vs its', 'than vs then', etc. are pretty uncommon.

2 more replies

Igelau5y ago

It could use some Polish.

1 more reply

mark-r5y ago

I have a simple way of remembering when to leave out the apostrophe. His, hers, its are all possessive and none of them have an apostrophe.

Glanford5y ago

In this particular case 'it's' can also be possessive although it's considered non-standard, so to be correct you can always treat it like a contraction of 'it is'.

1 more reply

hugey0105y ago

raverbashing5y ago

Yeah I'm always annoyed with this kind of mistake. Especially as non-native speakers should know better than the native ones (which usually don't give a f.).

Now the point about internally doing ECC is an interesting one, could be a way out of this mess. And apparently ECC is more available in AMD land

simias5y ago

For a 2nd language speaker making these homophonic mistakes is actually a sign of fluency. It means that you just transcribe a mental flow of words instead of consciously constructing the language.

The first time I wrote "your" instead of "you're" in English I thought it was quite a milestone!

3 more replies

tssva5y ago

The really annoying thing is that auto correct on mobile device keyboards will often want to incorrectly change "its" to "it's" or vice versa.

1 more reply

phkahler5y ago

>> But is ECC more available in AMD land?

Yes it is. The problem is they dont really advertise it. I'm not certain but it might even be standard on AMD chips, but if they dont say so and board makers are also unclear, who knows...

ethbr05y ago

It's a market size problem.

For consumer motherboard OEMs, only AMD effectively has ECC support (Intel's has been so spotty and haphazard from product to product), and of AMD users, only a small number care about ECC.

So motherboard companies, being resource and time-starved as they are, don't make it a priority to address such a small user-base.

If Intel started shipping ECC on everything, it would go a long way towards shifting the market.

1 more reply

touisteur5y ago

I think it's available for customer SKUs on AMD and not just for servers like in 'Xeon-land'... How I've wanted an ECC-ready NUC...

2 more replies

young_unixer5y ago

africanboy5y ago

As a non native speaker, my phone has both the Italian and English dictionary, when I write its it always auto corrects to it's as soon as I hit space and sometimes it gets unnoticed.

MarkusWandel5y ago

mixmastamyk5y ago

A few years back memtest86 wouldn’t run on newer machines, has that been fixed?

MarkusWandel5y ago

Wouldn't know, I don't run newer machines. But since it's a boot option on Fedora disks, I imagine it would run.

salmon5y ago

You bought used RAM DIMMs and were surprised that they failed?

MarkusWandel5y ago

Used computers that have RAM in them. But as I wrote, two of those computers were brand new with new RAMs in them.

otterley5y ago

D. J. Bernstein (of qmail/daemontools fame) spoke of it over a decade ago as well. https://cr.yp.to/hardware/ecc.html

slim5y ago

these days he's more famous for the NaCl crypto library

loup-vaillant5y ago

(Note: EdDSA is still much much better than ECDSA, most notably because it's easier to implement correctly.)

linsomniac5y ago

They were all too willing to have us go to production relying on ECC to handle the memory error.

scottlamb5y ago

> They were all too willing to have us go to production relying on ECC to handle the memory error.

dboreham5y ago

You don't need to look at kernel crashes to speculate about bus and memory errors -- just check the logs on a few systems that do have ecc. Pretty soon you'll see correctable errors being reported.

maddyboo5y ago

justin665y ago

JoeAltmaier5y ago

electricshampo15y ago

Patrol scrub is basically this (https://www.intel.com/content/dam/www/public/us/en/documents... it is built into the memory controller, no OS involvement is needed.

electricshampo15y ago

working link:

https://www.intel.com/content/dam/www/public/us/en/documents...

musingsole5y ago

a13692099935y ago

> Most of those will still look like corrupted data,

Not if you're using a typical 72-bit SECDED code[0].

0: https://en.wikipedia.org/wiki/Hamming_code#Hamming_codes_wit...

1 more reply

temac5y ago

The RAM already needs to be refreshed and IIRC it is done by the memory controller when not in sleep mode.

However I don't remember if there are provisions for ECC checking in case there are some dedicated refresh commands. I hope so, but I'm not sure.

jacquesm5y ago

Interesting, similar to scrubbing raid arrays. How often do those double bitflips appear though? You'd have to have a pretty long running server for that to be a problem, no?

jeffbee5y ago

According to Google's old paper on the subject, about 1% of their machines suffered from an uncorrectable (i.e. multi-bit) error in a year.

jkuria5y ago

For those, like me, wondering what ECC is, here's an explanation:

https://www.tomshardware.com/reviews/ecc-memory-ram-glossary...

KingMachiavelli5y ago

Is there such a thing as 'software' ECC where a segment in memory also has a checksum stored in memory and the CPU just verifies it when the memory segment is accessed?

a13692099935y ago

1: Offhand, I think you could fit triple or even quadruple error correction into that kind of space (there's room for eight layers of SECDED, but I don't remember how well bit-level ECC scales).

temac5y ago

Intel has some recent patents on that.

freeqaz5y ago

I bought ECC RAM for my laptop and it definitely was about 4x the price. It's valuable to me for a few reasons -- peace of mind being a big one.

Bit flips happen and are real. I really wish ECC was plentiful and not brutally expensive!

bitcharmer5y ago

This is the first time I hear about a laptop that supports ECC memory. Could you please share the make and model?

dijit5y ago

I have a Dell Precision 5520 (chassis of an XPS 15) which has a Xeon and ECC memory.

Finding a memory upgrade seems difficult though.

1 more reply

lb1lf5y ago

-My boss has a Xeon Dell - a 7550, methinks - luggable.

It is filled to the gunwales with ECC RAM.

Cost him the equivalent of $7k or so. Eeek.

xxs5y ago

Lenovo has Xeon laptops[0], and technically Intel used to support ECC on i3 (and celeron, etc.)

0: https://www.lenovo.com/us/en/laptops/thinkpad/thinkpad-p/Thi...

bluedino5y ago

Lenovo (P series) and HP workstation models also support ECC

temac5y ago

Note that the price is mostly due to market segmentation, in your case most of it by the laptop vendor (of course some for Intel, but not that much compared to the laptop vendor)

washadjeffmad5y ago

jjeaff5y ago

You should be able to check logs for corrected errors, right?

I'm guessing you won't find any.

phh5y ago

bitcharmer5y ago

So I'd say ECC is not only important but insanely impactful. There's a reason why many organizations don't even want to hear about getting rigs with non-ECC memory.

tomxor5y ago

I like when people back up their claims with numbers, but would you mind describing roughly what that 96% probability of error is based upon?

1 more reply

dejj5y ago

And even higher in the vicinity of radioactive cattle: https://www.jakepoz.com/debugging-behind-the-iron-curtain/

johndough5y ago

I ran a memory test for two weeks straight on a consumer laptop with 8 GB RAM and could not get a single bit flip, so your mileage may vary.

1 more reply

gzalo5y ago

That number is flawed, and the author did a follow-up with better results: http://lambda-diode.com/opinion/ecc-memory-2

"33 to 600 days to get a 96% chance of getting a bit error." Still, it seems way too high. I guess anyone with ECC RAM could confirm that they are getting those sort of recovered error rates?

davidw5y ago

Could you measure altitude with memory?

3 more replies

mrlala5y ago

Now of course I agree that if you want to take every precaution, go ECC, but simple observation prove that this "problem" can't be as bad as the numbers are saying.

1 more reply

formerly_proven5y ago

trevyn5y ago

If true, that seems like... a very straightforward bucket to test if you’re in.

1 more reply

johnklos5y ago

From the fortune database:

As far as we know, our computer has never had an undetected error. -- Weisert

londons_explore5y ago

I simply care that my computer executes code perfectly. Let's settle on "one instance of unintended behaviour per hundred years" for that metric.

If it needs ECC memory to do that, then fit it with ECC memory. If there are other ways to achieve that (for example deeper dram cells to be more robust to cosmic rays) that's fine too.

Just meet the reliability spec - I don't care how.

simias5y ago

Then you'll have to pay a huge primer for that privilege. I can assure you that your standard computer components are not rated for century-scale use.

That's why I've always been on the fence with this ECC thing. For servers it's vital because you need stability and security.

For desktops I think that for a long time it was fine without ECC. If I have to chose between having, say, 30% more RAM or avoid a potential crash once a year, I'll probably take the additional RAM.

ClumsyPilot5y ago

But it isn't just a crash, it's also silent data corruption that will never be detected

2 more replies

loup-vaillant5y ago

> I can assure you that your standard computer components are not rated for century-scale use.

  Years  1- 5:  1 error  per century.
  Years  6-10:  3 errors per century.
  Years 10-15: 10 errors per century.
  Years 15-20: 20 errors per century.
  Years 20-30:  1 error  per *year*.
  Years 30+  : the chip is broken.

temac5y ago

paulie_a5y ago

There was a great defcon talk a while back regarding using ECC. The concept was called "dns jitter"

Basically you can register domains using small bit differences for domains and start getting email and such for that domain

If I recall correctly the example given was a variation of microsoft.com

All because so much equipment doesn't use ECC

jeffbee5y ago

miclosoft.com is only one bit away from microsoft.com. Used to see these problems all the time when I worked on gmail.

thu21115y ago

1 more reply

zx2c45y ago

Voila http://media.blackhat.com/bh-us-11/Dinaburg/BH_US_11_Dinabur...

tyoma5y ago

There were some great follow up talks as well! It turns out a viable attack vector was also MX records. And there was the guy who registered kremlin.re ( versus kremlin.ru ).

MAXPOOL5y ago

Well shit.

How about GPU's and their GDDR SDRAM? Do they have parity bits?

layer85y ago

vbezhenar5y ago

Cheap pro-level GPUs don't have ECC RAM either. And it's not easy to find out, it might be buried somewhere.

spacedcowboy5y ago

Seems likely that “bad ram” was the reason for the recent AT&T fiber issues, given that 1 bit was being flipped reliably in data packets [1]

[1]: https://twitter.com/catfish_man/status/1335373029245775872?l...

p_l5y ago

I have had in the past encountered an issue where line card was stripping exactly one bit of address data. Don't know of the follow up investigation, but it probably wasn't TCAM

SV_BubbleTime5y ago

I think you meant seems unlikely

louwrentius5y ago

ECC matters, even on the desktop, it's not even a discussion, to me.

If you think it doesn't matter: how do you know? If you don't run with ECC memory, you'll never know if memory was corrupted (and recovered).

That blue screen, that sudden reboot, that program crashing. That corrupted picture of your kid.

Who knows.

If you care about your data, use ECC memory in your computers.

supernovae5y ago

I've got nearly 30 years of experience and not once has non ECC memory lead to corruption. Maybe a crash, maybe a panic, maybe a kernel dump...

But.. in all my time operating servers over 3 decades, it's always been bad drivers, bad code and problematic hardware that's caused most of my headaches.

Take the average web app - you run it on 10 commodity systems and distribute the load.. if one crashes, so what. Chances are, a node will crash for many more reasons other than memory issues.

If you have an app that requires massive amounts of ram or you do put all of your begs in one basket, then ECC makes sense...

I just know i like going horizontal and I avoid vertical monoliths.

ptx5y ago

> if one crashes, so what

Crashes might not matter, but silent data corruption does. The owner/user of that data will care when they eventually discover that it at some point mysteriously got corrupted.

ajnin5y ago

> I've got nearly 30 years of experience and not once has non ECC memory lead to corruption

How do you know?

1 more reply

louwrentius5y ago

The problem with memory corruption is not just crashes, those are the more benign outcomes.

The real killer is data corruption. Houw would you even begin to know that data is corrupted until it is too late?

1 more reply

alkonaut5y ago

I know what it does, but I still don’t care (so long as it costs money or even 1% performance).

It’s a tradeoff between money/performance and the frequency of crashes, corruption etc.

Bit rot is just one of many threats to my data. Backups take care of that as well as other threats like theft, fire, accidental deletion.

louwrentius5y ago

> Bit rot is just one of many threats to my data. Backups take care of that as well as other threats like theft, fire, accidental deletion.

No, that's the big mistake people make: backups just backup bit-rotted data, until it is too late and the last good version is rotated out and lost forever.

1 more reply

mark-r5y ago

Backups can't fix what was already corrupted when it was written to disk.

knorker5y ago

I would have bought computers when I "wanted one". Now I buy them when I need one. Because buying a non-ECC computer just feels like buying a defective product.

In the last 10 years I would have bought TWICE as many computers if they hadn't segmented their market.

Fuck intel. I sense that Linus self-censored himself in this post, and like me is even angrier than the text implies.

skibbityboop5y ago

Have you finally stopped buying Intel? Current Ryzens are a much better CPU anyhow, just dump Intel and be happy with your ECC and everything else.

knorker5y ago

I'm in the market for a new laptop (since a few years). Is there something like the X1 carbon but with ECC?

vbezhenar5y ago

There are plenty of Xeons which are suitable for desktops and there are plenty of laptops with Xeons.

Price is not nice though.

19965y ago

Linus is absolutely right.

I am trying to get a laptop with dual NVMe (for ZFS) and ECC RAM. I can't get that, at all - even without the other fancy things I would like such as a 4k OLED with pen/touchscreen.

In 2020, even the Dell XPS stopped shipping OLED (goodbye dear 7390!)

I will gladly give my money to anyone who sells AMD laptop with ECC. Hopefully, it will show there's demand for "high end yet non bulky laptops"

miahi5y ago

Lenovo P53 has 3 NVMe slots, 4k OLED with touchscreen (and optional pen) and up to 128GB ECC RAM if you choose the Xeon processor. It's big and heavy, but it exists.

I hope AMD will create a better market for the ECC laptop memory (right now it's hard to find + expensive).

19965y ago

I know- I had my eye on this very model, as you can even add a mSata on the WWAN slot to get a 4th drive.

Unfortunately, Lenovo is not selling the P53 anymore, which is exactly why I say I can't get that even in a "bulky" version.

IgorPartola5y ago

If in doubt, get ECC. Do your own research on how it works and why. This post won’t explain it, just will blame Intel (probably rightfully so).

eloy5y ago

He does explain it:

It might be false, but I think it's a reasonable assumption.

IgorPartola5y ago

To someone on HN who isn’t familiar with what ECC does that explains nothing about how ECC works and how it could have prevented these situations. Or how often they really happen.

3 more replies

turminal5y ago

It's a message in a thread from a technological forum. I think its intended audience are people already familiar with ECC unlike here on HN.

IgorPartola5y ago

Exactly my point :)

tgbugs5y ago

A relevant Bryan Cantrill talk segment on this, which heightens the paranoia around this. Namely, firmware hiding correctable errors and only reporting uncorrectable errors.

https://www.youtube.com/watch?t=2104&v=fE2KDzZaxvE

type05y ago

rajesh-sOP5y ago

Good point on the need for awareness!

kozak5y ago

FartyMcFarter5y ago

Does anyone know why ECC memory requires the CPU to support it?

Naively, I can understand why error reporting has dependencies on other parts of the system, but it would seem possible for error correction to work transparently.

toast05y ago

As implemented today, ECC is a feature of the memory controller. You need special ram, because instead of 8 parallel rams per bank, you need 9, and all the extra data lines to go to the controller.

Modern CPUs have integrated memory controllers, so that's why the CPU needs to support it.

Correction without reporting isn't great; anyway, you need a reporting mechanism for uncorrectable errors, or all you've done is ensure any memory errors you do experience are worse.

fomine35y ago

Error correcting and reporting is better, but even only correcting is better than non-ECC. I wonder this compromise could be accepted by Intel.

1 more reply

TomVDB5y ago

I think the memory just provides additional storage bits to detect the issue, but doesn't contain the logic.

This is in line with all technical parameters of DRAM: everything must be as cheap as possible, and all the difficult parts are moved to the memory controller.

Which is the right thing to do, because you can share one memory controller with multiple DRAM chips.

wmf5y ago

Historically the detection and correction is performed in the memory controller not the DRAM.

vlovich1235y ago

A couple of years ago there was advancements that claimed to make Rowhammer work on ECC RAM even with DDR4 [1]. Is that no longer a concern for some reason?

I would think the only guaranteed solutions to Rowhammer are actually cryptographic digests and/or guard pages.

[1] https://www.zdnet.com/article/rowhammer-attacks-can-now-bypa...

theevilsharpie5y ago

ECC isn't a direct mitigation against Rowhammer attacks, as memory errors caused by three or more flipped bits would still go undetected (unless you're using ChipKill, but that's a rare setup).

However, flipped three bits simultaneously isn't trivial, and the attempts that flip fewer bits will be detected and logged.

GregarianChild5y ago

[1] B. Schroeder et al, DRAM Errors in the Wild: A Large-Scale Field Study http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

rajesh-sOP5y ago

Right! Section 1.3 of this publication discusses possible mitigations for the row hammer problem and where ECC fits in

https://users.ece.cmu.edu/~omutlu/pub/rowhammer-summary.pdf

1 more reply

amelius5y ago

Does Apple use ECC in its M1 laptop?

dijit5y ago

No. It uses a unified package of LPDDR4x SDRAM

my1235y ago

LPDDR4X systems with ECC exist, but it indeed looks like Apple M1 systems aren't one...

graeme5y ago

This is my one worry. I have an imac pro and anecdotally it has been a LOT more reliable than my old macbook pro. The imac pro has ecc.

alexwillner5y ago

At least some kernel log messages imply that the M1 might support ECC:

https://eclecticlight.co/2020/12/09/what-happens-when-an-m1-...

greyhair5y ago

ECC is required on mission critical hardware.

If you are fine with your PC experiencing one or two bits flipped in memory every month, then you really don't need ECC. That is the question you need to answer.

For mission critical systems? ECC is a requirement.

willis9365y ago

Whenever this topic comes up I wonder how much more resilient are CPU registers compared to DRAM.

MisterTea5y ago

> ECC availability matters a lot - exactly because Intel has been instrumental in killing the whole ECC industry with it's horribly bad market segmentation.

The phrase that strikes me is "horribly bad market segmentation". I agree 100%.

_0ffh5y ago

wmf5y ago

Yes, PCs used to use parity memory.

_0ffh5y ago

That's the name I couldn't quite recover from my memory when I asked, exactly!

wicket5y ago

Over the years, I don't think I've ever been able to explain to anyone that their memory error could have been caused a cosmic ray without being laughed at.

mauri8705y ago

In case the page os not loading, refer to the wayback machine[1] for a copy

[1] https://web.archive.org/web/*/https://www.realworldtech.com/...

jhoechtl5y ago

I definitely do not want Linus Torvalds yelling at me in that tone --- but reading his utterings is certainly entertaining.

aborsy5y ago

For the average user, what’s the impact of bit flips in memory in practical terms?

I am not talking about servers dealing with critical data.

Suppose that I maintain a repository (documents, audio and video), one copy in a ZFS-ECC system and one in an ext4-nonECC system.

Would I notice a difference between these two copies after 5-10 years?

That tells us if ECC matters for most people.

theevilsharpie5y ago

> For the average user, what’s the impact of bit flips in memory in practical terms?

The most likely impact (other than nothing, if bits are flipped in unused memory) is program crashes or system lock-ups for no apparent reason.

throwaway98705y ago

This isn't about disk storage, this is about DRAM. A bit flip in DRAM might corrupt data, but could also cause random crashes and system hangs. That generally matters to everyone.

1 more reply

arendtio5y ago

It would be interesting to see how many more kernel oops appear on machines without ECC compared to those with ECC.

indolering5y ago

My favorite example is a bit flip altering election results:

https://www.wnycstudios.org/podcasts/radiolab/articles/bit-f...

trissylegs5y ago

elgfare5y ago

For those out of the loop like me, ECC does indeed stand for error correcting code. https://en.m.wikipedia.org/wiki/ECC_memory

nix235y ago

I always have that conversation when ZFS comes up. Some peoples think ZFS NEEDS ECC, but in fact ZFS needs ECC much as every single one FS in Linux. And every single reliable Machine needs ECC.

ratiolat5y ago

I have: Asus PRIME A520M-K Motherboard 2x M391A2K43DB1-CVF (Samsung 16GiB ECC Unbuffered RAM) AMD Ryzen 5 3600

I specifically was looking for bang for buck, low(er) wattage and ECC.

IanCutress5y ago

unixhero5y ago

Fantastic burn by Linus Torvalds whom also had some skin in the CPU game.

Offtopic, I wonder if he trawls that site regularly. And eventually I wonder, is he here also? :)

Noxmiles5y ago

I was reading it and thought: wow, this guy is absolutely right! Great things he's talking about. After reading it, i saw it was Linux Torvalds :D

raghavtoshniwal5y ago

Once trained a GPT2 model to do text-gen on Linus’ emails. Boy there were some choice angry rants and non-sensical technical jargon that was generated

z3t45y ago

Memory often comes with lifetime guarantees. If they had ECC it would be much easier to detect bad memory...

JumpCrisscross5y ago

What is the status of ECC on Macs?

CalChris5y ago

iMac Pro which has Xeon M. There's a good chance that will go away with the new Apple Silicon iMac Pro due out this year. MacRumors roundup article doesn't mention ECC.

https://www.macrumors.com/roundup/imac/

qwerty4561275y ago

ECC should be everywhere. It seems outrageous to me almost no laptops have ECC.

belzebalex5y ago

Asked myself, would it be possible to build a Geiger counter with RAM?

rafaelturk5y ago

Little bit offtopic: Again seems that Intel? what?! is the one lowering the bar.

b0rsuk5y ago

I browsed some online listings for ECC memory modules, and they seem to be sold one module at a time. Standard DDR4 modules are sold in pairs, to benefit from dual channel mode.

Does ECC memory support dual channel??

srtjstjsj5y ago

I guess Linus's recent project to communicate more respectfully didn't pan out.

musingsole5y ago

It's a shame we don't have ECC for individuals. How many of society's bugs come from someone wandering around with a bit flipped?

rahimiali5y ago

I have trouble parsing information from this rant. Is someone willing to translate this into an argument (a string of facts tied by logical steps)?

mark-r5y ago

wagslane5y ago

It really does. I did a write-up recently on it as I was diving in and understanding the benefits: https://qvault.io/2020/09/17/very-basic-intro-to-elliptic-cu...

avianes5y ago

Be careful not to confuse ECC memory with ECC encryption.

ECC memory = memory with Error-Correcting Code

ECC encryption = Elliptic Curve Cryptography

sally16205y ago

Linux is accusing Intel of killing ECC intentionally. But that is not really the case, they just wanted people to pay up.

If you care about ECC, you pay for Xeon. Majority of consumers don't run critical applications on their devices, so they are happy with a cheap device that may crash once in a while.

AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.

Honestly, what is the point of bullet-proof hardware when the software reliability (at least on consumer devices) has gone down to two nines.

Dylan168075y ago

Intel had to kill consumer ECC as part of making it a feature that people can "pay up" for. That's very intentional.

> AMD is only changing the game because they are trying to undercut Intel. They have been putting pro features into all of their CPUs including over-clocking, extra PCIE lanes and ECC.

You are correct to call them a corporation. AMD is not your friend, but they are the good actor in this fight.

fomine35y ago

sys_647385y ago

ECC memory is predominantly used in servers where failure absolutely must be identified and logged. The desktop market to a lesser extent due to lack of mission critical tasks being run from there.

dijit5y ago

There are situations though, where you’re working on a document and the documents “save” format is a memory dump. Corruption for things of that type (Adobe RAW for example) would remove data.

It might present itself as a 1pixel colour difference, but it could be more damaging (incorrect finances, in accounting software for example). Software trusts memory; but memory can lie.

That’s dangerous.

MaxBarraclough5y ago

That's an interesting point. In an extreme case, an order or money transfer might be placed for an incorrect quantity, or to an incorrect recipient.

1 more reply

projektfu5y ago

Perhaps consumer-grade software that needs guarantees of correctness should be using error correction in software. For example, database records for financial software, DNS, e-mail addresses, etc.

jkbbwr5y ago

To be fair, if your save mechanism is just a straight memory dump with no checksums and validation. You have bigger issues.

2 more replies

sys_647385y ago

2 more replies

j / k navigate · click thread line to collapse