How multi-disk failures happen (opens in new tab)

(sysadmin1138.net)

113 pointsSerplat13y ago91 comments

91 comments

VLM13y ago

My favorite raid 5 failure mode is when a old hardware card fails and you have no stock spare and the maint contract was not renewed years ago and the cost of a new card is too much to be expensed and buying as a capital replacement will take a week or two minimum to be approved. And/or the card has been discontinued so you have to buy a used one from a shady foreign surplus dealer. I've seen too much of this type of thing... I can tolerate the (minimal) cost of software raid but I can't survive the possible downtime of hardware raid, so its been software raid for me for pretty much the last decade.

Another fun one was the quad redundant power supply with all four plugs going into the same power strip.

And there was the power supply that blew out every drive in the box simultaneously. I suppose its not any worse than a lightning strike, other than the onsite tech assumed it was a cooling failure, so he replaced the dusty fans and all the drives, thus destroying an entire set of drives upon powerup (and fans, I would guess).

The poison drive tray where every slot you jammed it into, it bent the backplane pins. That turned into a huge expensive disaster.

akg_6713y ago

You don't need to replace old/discontinued RAID card with the used one (assuming you mean same make and model). The new card from same manufacturer will work. The RAID card writes DCBs on disks that are read by the new RAID card upon replacement.

CrLf13y ago

"The RAID card writes DCBs on disks that are read by the new RAID card upon replacement."

Which also makes for very nice RAID failures, like this one that has happened to me on an HP controller:

A drive fails because of some SCSI electronics problem and when you replace it, the controller gives it a different SCSI ID. Now, the controller maps RAID arrays to drives, and it is now impossible to add the replacement drive to the degraded array because SCSI IDs in these controllers aren't user defined and the controller doesn't allow the degraded array to be modified.

And since the controller has now happily written it's configuration onto the drives, it doesn't matter that you shuffle the drives around to try to force the controller into giving up it's internal configuration.

Oh, and the controller is an onboard controller, so you can't just replace it with another one (which would also read the configuration on disk and put himself in the same stupid state, I suppose).

wazoox13y ago

Using RAID-5 is the primary error here. RAID-5 (or single parity RAID of any kind) is obsolete, period. The story here doesn't ring true to me to be honest; I'm currently herding several hundred multi-terabytes servers, and multiple drive failures appear in one and only case: when using Seagate Barracuda ES2 1TB or WD desktop class drives. These are the two very problematic setups. In all other cases, use RAID-6 and all will be well.

I'd add that current "enhanced format" drives are tremendously better than most older drives. If your drive is a 1 or 2 TB old (512B sectors) drive, use it only for backup or whatever menial use for unimportant data.

gav13y ago

To add some extra emphasis to wazoox's point, RAID-6 is always a better choice than RAID-5.

If you're willing take a capacity hit for improved write performance, RAID-1+0 is great. Though you can only survive two disks failing if they are in different pairs.

You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.

thaumaturgy13y ago

> You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.

Right. Because it bears repeating still: RAID is not for backup, RAID is for high availability.

1 more reply

lysol13y ago

And of course, depending on the application, you should be using 1+0 for performance reasons over 5 anyway, if it's a database server.

Ologn13y ago

Yes. The decision to use RAID5 instead of RAIDs 1, 10 or 0+1 shows a decision to cut costs. With mirroring, the mirror drives would take over when a disk failed.

Also, only one hot spare is in the set. Another cost saving.

Yet another decision - RAID5 across 7 disks plus a hot spare. Instead of say across 6 disks or 5 disks. You have two more chances for a disk to go bust and will have to be rebuilt from parity.

What if the disks are OK but the server host adapter card gets fried? Or the cable between the server and array? Some disk arrays allow for redundant access to the array, and some OS's can handle that failover.

Before I read the article, I thought it might discuss heat. Excessive heat is usually the cause when disk arrays start melting down one after another. Usually the meltdown happens in an on-site server closet/room which was never properly set up for having servers running 24/7. Usually the straw which breaks the back is a combination of more equipment added, and hot summer days. Then portable ACs are purchased to temporarily mitigate things, but if their condensation water deposits are not regularly dumped, they stop helping. This situation occurs more than you would imagine, luckily I have not been the one who had to deal with this for every time I have seen it (although sometimes I have). Usually the servers are non-production ones which don't make the cut to go into the data center.

The heat problem happens in data centers as well, believe it or not. A cheap thermometer is worth buying if you sense too much heat around your servers. Usually the heat problem is less bad, but the general data center temperature is a few degrees higher than what it should be, and this leads to more equipment failure.

mprovost13y ago

Hard drives are pretty resilient to high temperatures. Google did a reliability analysis of thousands of hard drives and found:

"Overall our experiments can conﬁrm previously reported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. This is a fairly surprising result, which could indicate that datacenter or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives. We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do."

http://research.google.com/archive/disk_failures.pdf

1 more reply

gav13y ago

Personally I've not had much luck with hot-spares, I'd prefer to have the spare in the array (in the case of RAID-6) so I can find out if there's a problem with that drive before it's the only thing standing in the way of total failure.

I'm a fan of the temperature@lert USB sensors; $130 gets you peace of mind: http://www.temperaturealert.com/

Dylan1680713y ago

>With mirroring, the mirror drives would take over when a disk failed.

The mirror drives that also have bad sectors? Then you go even longer without noticing you have problems.

archagon13y ago

Maybe for the enterprise. What about home use?

About a month ago, I had 4 drives scattered around the house, each in its own enclosure, and I wanted to consolidate them into one unit. Money was an issue, so I wanted to recycle as many of them as possible instead of buying new ones. A Synology NAS along with a single extra drive allowed me near-optimal use of space with 1-drive redundancy. Of course, I have weekly backups to an external drive, so even if the array fails during a drive swap, I'll still have all my important files.

Any other solution would either require me to buy more drives (a significant expense at $100+ a pop), sacrifice redundancy, or build my own NAS with ZFS (which would have significant administration overhead, cost more, and be larger than my Synology unit).

mbell13y ago

Backing up to an external drive isn't enough if your really worried about the data. If your house burns down, or the single backup drive fails, your out of luck.

Synology's devices support automatic backup to S3, use it.

caw13y ago

EMC uses RAID5 as the default for storage arrays, and then has some number of global hot spares. Netapp uses RAID6 by default, and then also has some number of hot spares. I've never had data loss from either system as a result of multi-drive failure. RAID5 is perfectly fine in most instances.

Desktop drives will drop out of RAID arrays frequently, so you have to use RAID6 if you choose to go that route. If a disk drops into deep checking mode for physical errors, then it won't respond to the RAID controller fast enough, and then be considered a dead drive. It will subsequently be re-detected, and then array has to be rebuilt.

wazoox13y ago

> RAID5 is perfectly fine in most instances.

If you're using low capacity, enterprise class SAS drives, mostly, yes. However when using large capacity SATA drives, it most definitely isn't.

SATA drives (even "enterprise" SATA drives) have an official unrecoverable error rate of 1/10^14. From my experience, the truth is more like 1/10^13.

10^13 bits is roughly 10 terabytes. That means that every time you're reading 10 TB, you are statistically certain to encounter an unrecoverable bit error (and have a 1% chance of having 100 errors, of course). In the case of a rebuilding 10 TB array (only a couple 3 or 4 TB drives) using RAID-5, that means that you're almost sure to have an ECC error that will prevent you from ever rebuilding properly without corruption.

lobster4513y ago

I have an 8 bay Drobo Pro with eight 2 TB drives, and I have a drive go out every few months. Of course this is because we went cheap with WD Green drives. However the Drobo Pro offers to use two drives "as protection" so you can have two drives go out and it keeps going.

wayne_h13y ago

You could be having the TLER problem. WD green drives can take a long time to do error correction - this causes the raid to drop the drive as failed. This solution is to respond with an error quickly and let the raid fix it with parity.... but at least it doesn't drop the drive. Or buy more expensive drives. Wait isnt that the idea behind RAID: Redundant Array of Inexpensive Disks..

You might want to look at this... http://hardforum.com/showthread.php?t=1285254

btw, I am currently working on recovering an 8 bay drobopro thats in a reboot loop...

2 more replies

corresation13y ago

RAID-5 (or single parity RAID of any kind) is obsolete, period.

RAID-6 offers different compromises relative to RAID-5 (for one, twice the parity space), so it isn't quite like one is the successor of the other. And once you're talking about multiple disk failures, you're at the existential point where you should probably be talking about whole array failures (e.g. your controller has quietly been writing junk for the last hour), and how to deal with that scenario.

wazoox13y ago

>it isn't quite like one is the successor of the other.

Given the current price of hard drives, I don't get how "twice the parity space" can even matter. Furthermore, modern RAID controllers perform almost exactly the same using RAID-5 or RAID-6 (verified on most 3Ware, LSI, Adaptec and Areca controllers).

So yes, RAID-6 definitely is RAID-5 successor.

> how to deal with that scenario.

RAID is not an alternative to backup and never was. You deal with that scenario through proper backup or replication.

1 more reply

insaneirish13y ago

I'm going to be blunt for a moment. If you are not using ZFS, you deserve what you get.

As the author realizes, hardware RAID, or naive software RAID, is becoming more and more useless given the size of volumes and the bit densities (and thus error rates) of those drives.

The only solution to this is a proper file system and volume manager that can proactively discover bit rot and give you time to do something about it. At the moment, the only real solution is ZFS.

3amOpsGuy13y ago

ZFS is great, possibly the closest to perfection available at any price today.

But theres a word beginning with "O" and ending with "racle", they are so focused on the short term buck they are massacring their potential revenues with their short sighted approach of keeping Solaris out of everyone's hands.

cperciva13y ago

... which is why you should be using FreeBSD.

1 more reply

ak21713y ago

The Linux answer to ZFS is btrfs, and it's almost ready to go.

dredmorbius13y ago

What's it waiting on, and who other than Val is working on it?

Answering my own question: http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs....

... and Valerie Auror's less active on that than I'd thought. Hrm.

Dylan1680713y ago

Call me when btrfs has parity.

mprovost13y ago

That's only true for open source implementations (if you can call ZFS that, the only really decent implementation is still locked up inside Oracle). Netapp's WAFL (which ZFS copied) is the original implementation of a RAID system that is resilient to bit rot and the other array vendors (EMC, HDS etc) all have similar systems.

SageRaven13y ago

What features are open source ZFS implementations (ZFS v28) lacking to that on Solaris aside from encryption? I run a 6-drive raidz2 array on my FreeBSD workstation, and I put it through its paces, and it's pretty awesome. Rock solid stable, in my experience.

I hear TRIM support for ZFS is now in the HEAD branch and in the works for FreeBSD 10.

scarmig13y ago

ZFS is beautiful and wonderful, but my perhaps outdated understanding is that the Linux port isn't as stable or mature as it is elsewhere. Is that not true? Or is it that the advantages of ZFS are so great that people should switch to illumos or some BSD?

oomkiller13y ago

I haven't used the Linux version in a while, but as it's not a first class citizen of the kernel, I don't think it will ever be the same as a BSD implementation. In my opinion, if you're doing anything with storage, you'd be crazy not to use ZFS, even if that means learning a new OS/kernel.

lloeki13y ago

> the Linux port isn't as stable or mature as it is elsewhere.

Which one: the FUSE one, or the native ZFS on Linux under CDDL?

1 more reply

StavrosK13y ago

I'm using zfsonlinux on Ubuntu, and so far (months) it has been solid. Granted, it's just for media storage, but it's fine.

StavrosK13y ago

How does ZFS discover bit rot if you don't read the bits? I have mine scrubbing once a week, is that unnecessary?

rryan13y ago

ZFS auto-heals on read so the files you touch regularly are fine. You do need to scrub regularly to prevent bit-rot across all your files. I go with once-a-week as well.

1 more reply

SpikeGronim13y ago

If you follow the advice in this paper[1] you will be measuring media errors in your drives. That means re-reading all data every N days, even archived data. Without periodically re-reading and validating (checksumming) the data you can't tell if it has rotted in place. Since the distribution of errors over drives is very exponential you should then pro-actively remove the worst drives in your system. That will avoid an accumulation of errors and sudden multiple drive failure as described here.

Durability is like a diamond: it is forever.

1. http://research.google.com/pubs/pub32774.html

wiredfool13y ago

Diamonds burn just as well as any other carbon does, once you get them hot enough.

lostlogin13y ago

De Beers probably like this usage of diamonds.

sschueller13y ago

I have had a multi-disk failure occur with a RAID-1 setup. Server was pre-built from a large vendor and worked fine until both disks failed at the exact same time (within minutes).

Took the disks out to find that they had sequential serial numbers.

Called vendor for replacement only to have them tell me that they had issues with that batch, yet did not make any attempt to inform me.

Spent the day restoring from tape backup.

TLDR: If you buy a pre-built server check that the disks aren't all from the same batch.

mtts13y ago

This is a problem even if you don't buy pre-built. You're going to be buying similarly specced drives at similar times and you're probably buying from vendors from the same rough geographical area so chances are you're buying drives from the same batch anyway.

It used to be worse: all the drives in a RAID setup had to have the exact same specifications or the thing wouldn't work, which pretty much guaranteed near simultaneous failure of multiple drives, but even today, with somewhat more flexible software raid setups, it's still a problem.

At a place I used to work we used to joke that a drive failure warning from a RAID controller was nothing more than a signal to get out the backup tapes and start building a new server.

jevinskie13y ago

I also had a RAID-5 array fail due to 3/4 drives failing near-simultaneously. All of the drives were from the same batch. Some months later, a friend came to me with a computer problem. Her drive had failed. I was able to take a look at the drive and, to my amazement, the drive was from the same batch! Based off of a wild hunch, I swapped the controller board from my one remaining good drive into my friend's drive. The drive worked fine and I was able to recover all her data! It is interesting that the drive failures were likely due to the fact that the drives were from from the same batch but also that fact probably allowed me to seamlessly swap the controller boards!

CrLf13y ago

One thing I've learnt early on my career as a sysadmin is that disk quality is very important, and so is the quality of the RAID controller or software RAID subsystem. After you have a multiple drive failure on a supposedly safe RAID-1, and get forced into stitching it back into operation with a combination of "badblocks" and "dd", you'll quickly understand why...

A good RAID controller won't let a drive with bad sectors continue to operate silently in an array. Once an unreadable sector is detected, the drive is failed immediately, period.

The problem is in the detection, but good RAID controllers "scrub" the whole array periodically. If they don't, or if you are paranoid like me, the same can be accomplished by having "smartd" initiate a long "SMART" self-test on the drives every week.

Good controllers will even fail drives when a transient error happens, one which triggers a bad block reallocation by the drive, for example. This is what makes some people fix failed drives by taking them out and putting them back in. After a rebuild the drive will operate normally without any errors, but you are putting yourself at a serious risk of it failing during a rebuild if another drive happens to fail, so DON'T do this.

Some others will react differently to these transient errors. EMC arrays, for instance, will copy a suspicious drive to the hot-spare and call home for a replacement. This is much faster than a full rebuild, but also much safer because it doesn't increase the risk of a second drive failing while doing it.

Oh, and did I mention that cheap drives lie?

Avoid using desktop drives on production servers for important data, even in a RAID configuration, if you don't have some kind of replicated storage layer above your RAID layer (meaning you can afford recovering one node from backup for speed and resync with the master to make it current).

baruch13y ago

Your advice is ok for someone who is willing to take no risks and to spend the money on that. It is not strictly correct for all situations. In fact storage arrays are not likely to drop a disk on the first medium error since medium errors are a fact of life and do not necessarily indicate a bad disk. Ofcourse, given that there is a medium error it warrants a long term inspection to make sure that the medium errors are not consistent and come too often on a specific drive, that is a cause of concern but a single medium error is of no real significance.

I also found that higher-end drives lie, I used SAS Nearline drives that failed easy and often and I used standard SATA drives that were more resilient. It depends on the vendor and make. May also depend on the batch but I never found a proof for that in my work.

CrLf13y ago

Maybe I was wrong in using the term "transient error"...

A bad block reallocation can be seen as a transient error from the controller's perspective, but it isn't silent provided the drive doesn't lie about it (and one would expect that a particular storage system vendor doesn't choose - and brand - drives that lie to their own controllers).

The storage system may ignore medium errors that force a repeated read (below a certain threshold), but they shouldn't ignore a medium error where the bad sector reallocation count increases afterwards (which is just another medium error threshold being hit, this time by the drive itself).

I'm not saying that higher-end drives are more reliable or not. Given that most standard SATA errors go undetected for longer, one could even argue that higher-end drives seem to fail much more frequently... I've had more FC drives replaced in a single EMC storage array than in the rest of the servers (which have a mix of internal 2.5in SAS and older 3.5in SCSI320 drives), and we certainly replace more drives in servers than desktops.

But that's another topic entirely.

ChuckMcM13y ago

Having worked at NetApp for a few years at the turn of the century this is "old" news :-) But it is always important to internalize this stuff. M3 (or 3x mirrors) is still computationally the most efficient (no compute just I/O to three drives), R6 is the most space efficient at the expense of some CPU. Erasure codes are great for read only data sets (they can be relatively cheap and achieve good bandwidth) but they suffer a fairly high I/O burden during write (n data + m code blocks for one read-modify-write).

Bottom line is that reliably storing data is more complicated than just writing it on to a disk.

mrb13y ago

The author is wrong that "enterprise quality disks just plain last longer". CMU did a study on this topic on a population of 100 thousand drives, and found that enterprise-grade drives do not seem to be more reliable than consumer-grade drives. See the conclusion in: http://www.cs.cmu.edu/~bianca/fast07.pdf This legend must die.

The author is also wrong when saying "a non-recoverable read error [is] a function of disk electronics not a bad block". An NRE can happen for different reasons, one of them is when (data and error-correction) bits in the block get corrupted in a way that prevent the error-correction logic from detecting this error. So the block is technically bad, just not bad enough to cause the drive logic to declare it as a read failure.

akg_6713y ago

The CMU study is most probably flawed as it looked at hardware replacement records and didn't take into consideration the different usage and threshold for replacement between enterprise and consumer drives. Most enterprise drives are used in enterprise servers and storage systems that monitor drive errors closely using SMART. The threshold for drive errors is much lower with such systems and drives are replaced quickly. My company (storage system vendor) replaces disk even when the SMART alerts impending failure and doesn't wait for actual failure. This will come across in hardware replacement log as more frequent replacement of enterprise disks. The consumer disks are used in consumer systems. The consumers don't proactively replace disks, they wait until disk actually fails. This will show up in hardware replacement log as less frequent replacement of consumer disk.

I have actually used 'defective' enterprise disks in consumer systems for years after they were labeled defective by storage system vendors. About a decade ago, I used to buy such defective enterprise disks in bulk at auction from server and storage manufacturers and sold them as refurbished disks to consumers after testing.

mrb13y ago

I fail to see your point about the threshold of replacement. Assuming that enterprise-class drives get replaced sooner because sysadmins monitor SMART, it is still widely acknowledged that SMART errors are a strong indicated that the drive will fail soon. For example the Google study on drive reliability showed this correlation on consumer-class drives [1] There is no reason to believe this correlation doesn't exist with enterprise-class drives (or else, what would be the point of SMART?). Therefore the replacement threshold is mostly irrelevant as the enterprise drive replaced due to SMART would have failed soon anyway.

I really don't understand this skewed perception of consumer- vs enterprise-grade harddrives. Do you believe that enterprise CPUs are more reliable than consumer CPUs? How about enterprise NICs vs consumer NICs?

Consumer-grade drives are sold in volumes so much larger than enterprise-grade drives, that vendors have strong incentives to make them as reliable as possible. I would even say they have incentives to make them more reliable than enterprise-grade drives. Because a single percentage point improvement in their reliability will drastically reduce the costs associated to warranty claims and repairs.

My own experience confirms the CMU study. I have worked at 2 companies selling each about 2-5 thousand drives as part of appliances, to customers across the world. One company was using SCSI drives, the other IDE/SATA. And the replacement rates were similar.

I can see your point about the usage being different which could invalidate the CMU findings about consumer vs enterprise drive reliability. But I don't personally believe it explains it. The CMU study + my annecdotal evidence one 2-5 thousand drives + the fact that no study has ever showed data suggesting enterprise drives are more reliable, makes me think that they are not.

[1] http://static.googleusercontent.com/external_content/untrust...

wayne_h13y ago

I have been doing raid data recovery for many years. A very common scenario is "Two Drives Failed at Once". This is usually not the case. What usually happens is 1 drive fails. The raid then goes into degraded mode and continues to function. Nobody notices the warnings. Some time later... months or years even, a second drive fails, the raid goes down - now they notice. They call in the techs who declare 'two drives failed'. This is when your data is most at risk. People start swapping boards and drives, repower the system, rebuild drives, rebuild parity, force stale drives online etc. I have seen alot of raids that would have been recoverable had they done the proper steps. Then they hand it over and say they found it this way... didn't touch a thing... www.alandata.com

kristianp13y ago

Would have upvoted if not for the web link at the end of the comment.

meaty13y ago

Let us not forget the little considered other cause which usually takes out your tape infrastructure as well: fire.

One outfit I worked decided to stick a brand new APC UPS in the bottom of the rack as it was in their office. It promptly caught fire and burned the entire rack out. The fire protection system did fuck all as well other than scare the shit out of the staff. Scraping molten cat5 cables off with a paint scraper was not fun.

Fortunately it was all covered by the DR procedure. Tip: write one and test it. That's more important than anything.

jzwinck13y ago

Do you know what caused the fire? Improper installation? Overheating? Defective equipment?

meaty13y ago

Defective APC charge regulator apparently. To be honest they were good and their insurance paid out very quickly to both us and the site owners.

andrewvc13y ago

And this is why, everywhere I've ever worked, I've had to say RAID is NOT backup. There are varying degrees of receptiveness to this, because actual backups are a giant pain in the ass, and have a very annoying lifecycle.

DanBC13y ago

There are better reasons for RAID not being back than this obscure rare fault condition.

Corrado13y ago

Another way to think about this problem is to turn it on its head and assume that hard drives will fail and make it a strength and not a weakness. Something like OpenStack Storage[1] is built around the idea that hard drives are transient and replacing them should be painless. In fact, the more drives you have the less problems you have.

Basically, you keep multiple copies of the same data across different clusters of hardware. If a drive or two (or ten) go bad, just replace them, there is no rebuild time. Sure, it costs some disk space to keep n copies of data, but drives are just getting cheaper and there are de-duplication schemes being developed to help with this. Its not like RAID-6 is super efficient either.

Just my two cents...

[1] http://www.openstack.org/software/openstack-storage/

andrewcooke13y ago

isn't this what raid scrubbing is for? http://en.gentoo-wiki.com/wiki/Software_RAID_Install#Data_Sc...

    for raid in /sys/block/md*/md/sync_action; do                                   
        echo "check" >> ${raid}

does that fix the issue? i run that once a week. i thought i was ok. am i not? if i am, isn't this old news?

pfg13y ago

yes, you should be fine. AFAIK most distros set up cronjobs for this automatically (for example, Debian/Ubuntu runs that once a month).

ksec13y ago

I am going to ask a honestly stupid question. What is going to happen to ZFS? Sorry if this is Slightly off topic. Although the comments has already started discussing on it.

The OpenSource version of it, or the BSD version of it is only up to v28. And it seems after that Oracle is no longer putting out update as open source and what will happen after that? Disparity between Oracle version and BSD version? And are features still being developed? Most of limitation listed in Wiki hasn't change at all for the past years and are still listed as under development.

pixl9713y ago

R.I.N.A.B

Raid Is Not A Backup!

Oh, and your raid controller should monitor for smart errors and you should seek to replace disks when you start seeing sector rewrites.

ksadeghi13y ago

Not very efficient but I try to avoid placing all volumes on a single raid set.

martinced13y ago

I'm not a Unx sysadmin at all and don't know much about hard drive: I'm just a software dev.
But from the beginning of TFA, after reading this:
"Bad blocks. Two of them. However, as the blocks aren't anywhere near the active volumes they go largely undetected."
The
FIRST* thing that came to my mind was: "What!? Isn't that a long-solved problem!? Aren't disks / controllers / RAID setups much better now at detecting such problem right away".

I've got a huge issue with the "largely undetected". I may, at one point, need storage for a gig I'm working on. And I certainly don't want stuff problems like that to go "largely undetected".

So quickly skipping most of the article and going to the comments:

"It's worth pointing out that many hardware RAID controller support a periodic "scrubbing" operation ("Patrol Read" on Dell PERC controllers, "background scrub" on HP Smart Array controllers), and some software RAID implementations can do something similar (the "check" functionality in Linux md-style software RAID, for example). Running these kinds of tasks periodically will help maintain the "health" of your RAID arrays by forcing disks to perform block-level relocations for media defects and to accurately report uncorrectable errors up to the RAID controller or software RAID in a timely fashion."

To which the author of TFA himself replies:

"Yes, that is something I should have made clearer. This is the very reason that RAID systems have background processes that scan all the blocks."

Which leaves me all a bit confused about TFA, despite all the shiny graphs.

Basically, I don't really understand the premises of "bad blocks going largely undetected" in 2013...

papercruncher13y ago

I dealt with this exact problem for a number of years. Background scrubbing takes away I/O resources and can be a disaster on your workload if you rely on sequential reads/writes. For that reason, most controllers are configured by default to only scrub when the disk is totally idle which is never. Even if the controller had a better definition of idle, scrubbing an entire disk to find those rotten bits would take a long long time, a disk would almost certainly fail before that.

ars13y ago

I use the built in SMART full disk check. It's quite good at only reading when the disk is idle, and it checks the entire disk.

A quick self test every day for all disks, and a long (i.e. full read) self test once a week.

The RAID is then checked on top of that one a month (although that slows things down a bit).

mikeash13y ago

With sufficient redundancy available, could you temporarily take a drive out of the RAID for scrubbing, and then add it back in when you're done, to avoid conflicting with ongoing work and destroying linear access patterns?

1 more reply

baruch13y ago

The disks themselves (new ones at least) have a background scan process (called BMS or BGMS) which helps considerably. The one thing the disk can't do by itself is correct unrecoverable errors since by the definition the disk can't recover from them :-)

The combination of BMS and disk scrubbing at the RAID level should handle almost all of the issues that are pointed by the original post.

Though RAID scrubs can and do take a long time to complete, depending on the performance impact that you are willing to suffer on a continuous basis it can take a week or two to perform proper scrubbing.

Proper scrubbing would include not just reading the RAID chunk on a disk but to also read the other associated chunks from the other disks and verify that the parity is still intact. In RAID5 you will not be able to recover if the parity is bad as you won't know what chunk has gone bad.

I've been coding such systems for a while now and as a shameless plug would point to http://disksurvey.com/blog/ if there are things of interest I'd be happy to take requests and write about them as well.

StavrosK13y ago

I have a home server with three disks and ZFS, for my photos and things, so I'm not an expert. However, Ubuntu's md-raid includes scrubbing once a week by default, and I added scrubbing to my ZFS setup via crontab, again once a week (I'm not sure if ZFS does it automatically, but I don't think it does. I would appreciate a correction, if someone knows for sure).

The article assumes no scrubbing, which is a stupid thing to run without, as detailed from the article. So it's basically "why pointing a gun at your foot and pulling the trigger is bad", "because you're going to shoot yourself in the foot".

jrockway13y ago

The article describes why scrubs don't happen often enough: it's slow and disruptive. I have a 3-way RAID-1 /home partition (long story) and it's checked on the first Sunday of the month. I always remember this because I can tell from the performance of my workstation that something is up with the disk. This is with operations like a single thread running "ls". If you're running a production service, you're also going to notice, and you're also going to have more than 3TB of drive to scan. That makes running regular scans rather difficult.

olgeni13y ago

You may add something like this to /etc/periodic.conf:

daily_status_zfs_enable="YES" daily_scrub_zfs_enable="YES" daily_scrub_zfs_default_threshold="6" # in days

and it will scrub the pools every 6 days (and send you a report in the daily run output).

1 more reply

X-Istence13y ago

ZFS doesn't do it automatically, you have to crontab it. I have it crontabbed for the 1st or the 15th.

pixl9713y ago

On 3ware(LSI) controllers you can schedule a 'verify' task to run on a schedule. I do believe it is on by default, but I could be wrong. It is good to tell admins about this being that an uninformed one may turn the settings off without realizing what can occur if it is disabled.

j / k navigate · click thread line to collapse

91 comments

VLM13y ago

Another fun one was the quad redundant power supply with all four plugs going into the same power strip.

The poison drive tray where every slot you jammed it into, it bent the backplane pins. That turned into a huge expensive disaster.

akg_6713y ago

CrLf13y ago

"The RAID card writes DCBs on disks that are read by the new RAID card upon replacement."

Which also makes for very nice RAID failures, like this one that has happened to me on an HP controller:

Oh, and the controller is an onboard controller, so you can't just replace it with another one (which would also read the configuration on disk and put himself in the same stupid state, I suppose).

wazoox13y ago

gav13y ago

To add some extra emphasis to wazoox's point, RAID-6 is always a better choice than RAID-5.

If you're willing take a capacity hit for improved write performance, RAID-1+0 is great. Though you can only survive two disks failing if they are in different pairs.

You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.

thaumaturgy13y ago

> You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.

Right. Because it bears repeating still: RAID is not for backup, RAID is for high availability.

1 more reply

lysol13y ago

And of course, depending on the application, you should be using 1+0 for performance reasons over 5 anyway, if it's a database server.

Ologn13y ago

Yes. The decision to use RAID5 instead of RAIDs 1, 10 or 0+1 shows a decision to cut costs. With mirroring, the mirror drives would take over when a disk failed.

Also, only one hot spare is in the set. Another cost saving.

Yet another decision - RAID5 across 7 disks plus a hot spare. Instead of say across 6 disks or 5 disks. You have two more chances for a disk to go bust and will have to be rebuilt from parity.

mprovost13y ago

Hard drives are pretty resilient to high temperatures. Google did a reliability analysis of thousands of hard drives and found:

http://research.google.com/archive/disk_failures.pdf

1 more reply

gav13y ago

I'm a fan of the temperature@lert USB sensors; $130 gets you peace of mind: http://www.temperaturealert.com/

Dylan1680713y ago

>With mirroring, the mirror drives would take over when a disk failed.

The mirror drives that also have bad sectors? Then you go even longer without noticing you have problems.

archagon13y ago

Maybe for the enterprise. What about home use?

mbell13y ago

Backing up to an external drive isn't enough if your really worried about the data. If your house burns down, or the single backup drive fails, your out of luck.

Synology's devices support automatic backup to S3, use it.

caw13y ago

wazoox13y ago

> RAID5 is perfectly fine in most instances.

If you're using low capacity, enterprise class SAS drives, mostly, yes. However when using large capacity SATA drives, it most definitely isn't.

SATA drives (even "enterprise" SATA drives) have an official unrecoverable error rate of 1/10^14. From my experience, the truth is more like 1/10^13.

lobster4513y ago

wayne_h13y ago

You might want to look at this... http://hardforum.com/showthread.php?t=1285254

btw, I am currently working on recovering an 8 bay drobopro thats in a reboot loop...

2 more replies

corresation13y ago

RAID-5 (or single parity RAID of any kind) is obsolete, period.

wazoox13y ago

>it isn't quite like one is the successor of the other.

So yes, RAID-6 definitely is RAID-5 successor.

> how to deal with that scenario.

RAID is not an alternative to backup and never was. You deal with that scenario through proper backup or replication.

1 more reply

insaneirish13y ago

I'm going to be blunt for a moment. If you are not using ZFS, you deserve what you get.

As the author realizes, hardware RAID, or naive software RAID, is becoming more and more useless given the size of volumes and the bit densities (and thus error rates) of those drives.

The only solution to this is a proper file system and volume manager that can proactively discover bit rot and give you time to do something about it. At the moment, the only real solution is ZFS.

3amOpsGuy13y ago

ZFS is great, possibly the closest to perfection available at any price today.

cperciva13y ago

... which is why you should be using FreeBSD.

1 more reply

ak21713y ago

The Linux answer to ZFS is btrfs, and it's almost ready to go.

dredmorbius13y ago

What's it waiting on, and who other than Val is working on it?

Answering my own question: http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs....

... and Valerie Auror's less active on that than I'd thought. Hrm.

Dylan1680713y ago

Call me when btrfs has parity.

mprovost13y ago

SageRaven13y ago

I hear TRIM support for ZFS is now in the HEAD branch and in the works for FreeBSD 10.

scarmig13y ago

oomkiller13y ago

lloeki13y ago

> the Linux port isn't as stable or mature as it is elsewhere.

Which one: the FUSE one, or the native ZFS on Linux under CDDL?

1 more reply

StavrosK13y ago

I'm using zfsonlinux on Ubuntu, and so far (months) it has been solid. Granted, it's just for media storage, but it's fine.

StavrosK13y ago

How does ZFS discover bit rot if you don't read the bits? I have mine scrubbing once a week, is that unnecessary?

rryan13y ago

ZFS auto-heals on read so the files you touch regularly are fine. You do need to scrub regularly to prevent bit-rot across all your files. I go with once-a-week as well.

1 more reply

SpikeGronim13y ago

Durability is like a diamond: it is forever.

1. http://research.google.com/pubs/pub32774.html

wiredfool13y ago

Diamonds burn just as well as any other carbon does, once you get them hot enough.

lostlogin13y ago

De Beers probably like this usage of diamonds.

sschueller13y ago

I have had a multi-disk failure occur with a RAID-1 setup. Server was pre-built from a large vendor and worked fine until both disks failed at the exact same time (within minutes).

Took the disks out to find that they had sequential serial numbers.

Called vendor for replacement only to have them tell me that they had issues with that batch, yet did not make any attempt to inform me.

Spent the day restoring from tape backup.

TLDR: If you buy a pre-built server check that the disks aren't all from the same batch.

mtts13y ago

At a place I used to work we used to joke that a drive failure warning from a RAID controller was nothing more than a signal to get out the backup tapes and start building a new server.

jevinskie13y ago

CrLf13y ago

A good RAID controller won't let a drive with bad sectors continue to operate silently in an array. Once an unreadable sector is detected, the drive is failed immediately, period.

Oh, and did I mention that cheap drives lie?

baruch13y ago

CrLf13y ago

Maybe I was wrong in using the term "transient error"...

But that's another topic entirely.

ChuckMcM13y ago

Bottom line is that reliably storing data is more complicated than just writing it on to a disk.

mrb13y ago

akg_6713y ago

mrb13y ago

[1] http://static.googleusercontent.com/external_content/untrust...

wayne_h13y ago

kristianp13y ago

Would have upvoted if not for the web link at the end of the comment.

meaty13y ago

Let us not forget the little considered other cause which usually takes out your tape infrastructure as well: fire.

Fortunately it was all covered by the DR procedure. Tip: write one and test it. That's more important than anything.

jzwinck13y ago

Do you know what caused the fire? Improper installation? Overheating? Defective equipment?

meaty13y ago

Defective APC charge regulator apparently. To be honest they were good and their insurance paid out very quickly to both us and the site owners.

andrewvc13y ago

DanBC13y ago

There are better reasons for RAID not being back than this obscure rare fault condition.

Corrado13y ago

Just my two cents...

[1] http://www.openstack.org/software/openstack-storage/

andrewcooke13y ago

isn't this what raid scrubbing is for? http://en.gentoo-wiki.com/wiki/Software_RAID_Install#Data_Sc...

    for raid in /sys/block/md*/md/sync_action; do                                   
        echo "check" >> ${raid}

does that fix the issue? i run that once a week. i thought i was ok. am i not? if i am, isn't this old news?

pfg13y ago

yes, you should be fine. AFAIK most distros set up cronjobs for this automatically (for example, Debian/Ubuntu runs that once a month).

ksec13y ago

I am going to ask a honestly stupid question. What is going to happen to ZFS? Sorry if this is Slightly off topic. Although the comments has already started discussing on it.

pixl9713y ago

R.I.N.A.B

Raid Is Not A Backup!

Oh, and your raid controller should monitor for smart errors and you should seek to replace disks when you start seeing sector rewrites.

ksadeghi13y ago

Not very efficient but I try to avoid placing all volumes on a single raid set.

martinced13y ago

I've got a huge issue with the "largely undetected". I may, at one point, need storage for a gig I'm working on. And I certainly don't want stuff problems like that to go "largely undetected".

So quickly skipping most of the article and going to the comments:

To which the author of TFA himself replies:

"Yes, that is something I should have made clearer. This is the very reason that RAID systems have background processes that scan all the blocks."

Which leaves me all a bit confused about TFA, despite all the shiny graphs.

Basically, I don't really understand the premises of "bad blocks going largely undetected" in 2013...

papercruncher13y ago

ars13y ago

I use the built in SMART full disk check. It's quite good at only reading when the disk is idle, and it checks the entire disk.

A quick self test every day for all disks, and a long (i.e. full read) self test once a week.

The RAID is then checked on top of that one a month (although that slows things down a bit).

mikeash13y ago

1 more reply

baruch13y ago

The combination of BMS and disk scrubbing at the RAID level should handle almost all of the issues that are pointed by the original post.

StavrosK13y ago

jrockway13y ago

olgeni13y ago

You may add something like this to /etc/periodic.conf:

daily_status_zfs_enable="YES" daily_scrub_zfs_enable="YES" daily_scrub_zfs_default_threshold="6" # in days

and it will scrub the pools every 6 days (and send you a report in the daily run output).

1 more reply

X-Istence13y ago

ZFS doesn't do it automatically, you have to crontab it. I have it crontabbed for the 1st or the 15th.

pixl9713y ago

j / k navigate · click thread line to collapse