Correlated Failures in Storage Systems (opens in new tab)

(blog.dshr.org)

41 pointslogicalstack5y ago6 comments

6 comments

We thought a lot about correlated storage failures - especially with regard to SSDs - as we rebuilt our infrastructure circa 2012/2013.

In the end, the low hanging fruit - or, the biggest actionable takeaway - was that when we build boot mirrors out of SSDs, they should not be identical SSDs.

This was a hunch I had, personally, and I think experience and, now, results like these, bear it out.

Consider: an SSD can fail in a logical way. Not because of physical stress or mechanical wear, which has all kinds of random noise in the results - but due to a particular sequence of usage. If the two SSDs are mirrored, it is possible that they receive identical usage sequences over their lifetime.

... which means they can fail identically - perhaps simultaneously.

Nothing fancy or interesting about the solution: all rsync.net storage arrays have boot mirrors that mix either the current generation Intel SSD with the previous generation Intel SSD or mix an Intel SSD with a Samsung SSD.

vidarh5y ago

I've seen highly correlated failures on regular hard-drives too. We had a large array of IBM DeathStars that failed approximately one every couple of weeks until the entire array had been replaced, for example.

But nothing like SSDs.

They absolutely can and do fail near simultaneously, but it doesn't even need to be with identical use. I've had multiple SSDs from the same batch fail the same week despite being in different arrays hosting different data, albeit similar usage patterns. If you're unlucky and get a bad firmware revision, suddenly you may face a cascade of failing drives before you have time to upgrade (I particularly remember a bad time dealing with failing OCZ SSDs...)

It's terrifying. My home NAS has four different brands for that reason. And of course I never trust a single array.

Dealing with storage has done more than anything else to make me worry about hardware risks... I really don't envy you running a storage service...

EDIT: IBM DeathStar refers to this, btw: https://en.m.wikipedia.org/wiki/Deskstar - see particularly the images. It was grim.

renox5y ago

> I've seen highly correlated failures on regular hard-drives too.

Yes, very mysterious too, until you discover that when someone made a big hole in the wall of the room containing the HDD storage bay and that the HDDs are covered by dust!

It was a looong time ago but I think I'll never forget opening the door and looking at the mess..

bluetwo5y ago

So the same reason you don't marry your cousin is the same reason why you don't backup your primary data to a second drive from the same batch: It amplifies the defects.

waterhouse5y ago

This is a good thing to do.

For higher-hanging fruit, if you don't have enough different models of drives to make them all unique, then you might still try to protect against a run of manufacturing defects. Suppose there was a slightly defective machine making a series of drives with a certain problem. If you do things like buy drives in different groups from different middlemen or at different times, and either take one from each group or put them into a big pool and grab them at random, then that decreases the likelihood of having multiple drives from a single defective run end up in the same array.

Psychlist5y ago

Also: different NAS hosts, RAID cards, etc. Those have correlated failure modes too.

My personal backup strategy of buying a different backup drive every time seems wiser the more I learn.

At work we have two different NAS setups, each full of a different brand of near-identical drives. But what we have been doing is buying a few new drives every quarter and rotating them in to the NAS boxes. So they're all WD 6TB Black or whatever, but of 12 drives we now have 4 original ones, then a pair 3 months newer than that, a pair 6 months newer and so on. The "old" drives go into random stuff round the office because we employ engineers and they seem to all like to have their own little 2-4 drive NAS boxes for "important stuff" (which is in many ways fine, we just have to regularly coach them on making sure the stuff they're actually working on is on our NAS where it gets backed up. We host a gitlab instance, for example, so their code and project docs are in that).

j / k navigate · click thread line to collapse