Another fun one was the quad redundant power supply with all four plugs going into the same power strip.
And there was the power supply that blew out every drive in the box simultaneously. I suppose its not any worse than a lightning strike, other than the onsite tech assumed it was a cooling failure, so he replaced the dusty fans and all the drives, thus destroying an entire set of drives upon powerup (and fans, I would guess).
The poison drive tray where every slot you jammed it into, it bent the backplane pins. That turned into a huge expensive disaster.
Which also makes for very nice RAID failures, like this one that has happened to me on an HP controller:
A drive fails because of some SCSI electronics problem and when you replace it, the controller gives it a different SCSI ID. Now, the controller maps RAID arrays to drives, and it is now impossible to add the replacement drive to the degraded array because SCSI IDs in these controllers aren't user defined and the controller doesn't allow the degraded array to be modified.
And since the controller has now happily written it's configuration onto the drives, it doesn't matter that you shuffle the drives around to try to force the controller into giving up it's internal configuration.
Oh, and the controller is an onboard controller, so you can't just replace it with another one (which would also read the configuration on disk and put himself in the same stupid state, I suppose).
I'd add that current "enhanced format" drives are tremendously better than most older drives. If your drive is a 1 or 2 TB old (512B sectors) drive, use it only for backup or whatever menial use for unimportant data.
If you're willing take a capacity hit for improved write performance, RAID-1+0 is great. Though you can only survive two disks failing if they are in different pairs.
You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.
Right. Because it bears repeating still: RAID is not for backup, RAID is for high availability.
Also, only one hot spare is in the set. Another cost saving.
Yet another decision - RAID5 across 7 disks plus a hot spare. Instead of say across 6 disks or 5 disks. You have two more chances for a disk to go bust and will have to be rebuilt from parity.
What if the disks are OK but the server host adapter card gets fried? Or the cable between the server and array? Some disk arrays allow for redundant access to the array, and some OS's can handle that failover.
Before I read the article, I thought it might discuss heat. Excessive heat is usually the cause when disk arrays start melting down one after another. Usually the meltdown happens in an on-site server closet/room which was never properly set up for having servers running 24/7. Usually the straw which breaks the back is a combination of more equipment added, and hot summer days. Then portable ACs are purchased to temporarily mitigate things, but if their condensation water deposits are not regularly dumped, they stop helping. This situation occurs more than you would imagine, luckily I have not been the one who had to deal with this for every time I have seen it (although sometimes I have). Usually the servers are non-production ones which don't make the cut to go into the data center.
The heat problem happens in data centers as well, believe it or not. A cheap thermometer is worth buying if you sense too much heat around your servers. Usually the heat problem is less bad, but the general data center temperature is a few degrees higher than what it should be, and this leads to more equipment failure.
"Overall our experiments can confirm previously reported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. This is a fairly surprising result, which could indicate that datacenter or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives. We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do."
I'm a fan of the temperature@lert USB sensors; $130 gets you peace of mind: http://www.temperaturealert.com/
The mirror drives that also have bad sectors? Then you go even longer without noticing you have problems.
About a month ago, I had 4 drives scattered around the house, each in its own enclosure, and I wanted to consolidate them into one unit. Money was an issue, so I wanted to recycle as many of them as possible instead of buying new ones. A Synology NAS along with a single extra drive allowed me near-optimal use of space with 1-drive redundancy. Of course, I have weekly backups to an external drive, so even if the array fails during a drive swap, I'll still have all my important files.
Any other solution would either require me to buy more drives (a significant expense at $100+ a pop), sacrifice redundancy, or build my own NAS with ZFS (which would have significant administration overhead, cost more, and be larger than my Synology unit).
Synology's devices support automatic backup to S3, use it.
Desktop drives will drop out of RAID arrays frequently, so you have to use RAID6 if you choose to go that route. If a disk drops into deep checking mode for physical errors, then it won't respond to the RAID controller fast enough, and then be considered a dead drive. It will subsequently be re-detected, and then array has to be rebuilt.
If you're using low capacity, enterprise class SAS drives, mostly, yes. However when using large capacity SATA drives, it most definitely isn't.
SATA drives (even "enterprise" SATA drives) have an official unrecoverable error rate of 1/10^14. From my experience, the truth is more like 1/10^13.
10^13 bits is roughly 10 terabytes. That means that every time you're reading 10 TB, you are statistically certain to encounter an unrecoverable bit error (and have a 1% chance of having 100 errors, of course). In the case of a rebuilding 10 TB array (only a couple 3 or 4 TB drives) using RAID-5, that means that you're almost sure to have an ECC error that will prevent you from ever rebuilding properly without corruption.
You might want to look at this... http://hardforum.com/showthread.php?t=1285254
btw, I am currently working on recovering an 8 bay drobopro thats in a reboot loop...
RAID-6 offers different compromises relative to RAID-5 (for one, twice the parity space), so it isn't quite like one is the successor of the other. And once you're talking about multiple disk failures, you're at the existential point where you should probably be talking about whole array failures (e.g. your controller has quietly been writing junk for the last hour), and how to deal with that scenario.
Given the current price of hard drives, I don't get how "twice the parity space" can even matter. Furthermore, modern RAID controllers perform almost exactly the same using RAID-5 or RAID-6 (verified on most 3Ware, LSI, Adaptec and Areca controllers).
So yes, RAID-6 definitely is RAID-5 successor.
> how to deal with that scenario.
RAID is not an alternative to backup and never was. You deal with that scenario through proper backup or replication.
As the author realizes, hardware RAID, or naive software RAID, is becoming more and more useless given the size of volumes and the bit densities (and thus error rates) of those drives.
The only solution to this is a proper file system and volume manager that can proactively discover bit rot and give you time to do something about it. At the moment, the only real solution is ZFS.
But theres a word beginning with "O" and ending with "racle", they are so focused on the short term buck they are massacring their potential revenues with their short sighted approach of keeping Solaris out of everyone's hands.
Answering my own question: http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs....
... and Valerie Auror's less active on that than I'd thought. Hrm.
I hear TRIM support for ZFS is now in the HEAD branch and in the works for FreeBSD 10.
Which one: the FUSE one, or the native ZFS on Linux under CDDL?
Durability is like a diamond: it is forever.
Took the disks out to find that they had sequential serial numbers.
Called vendor for replacement only to have them tell me that they had issues with that batch, yet did not make any attempt to inform me.
Spent the day restoring from tape backup.
TLDR: If you buy a pre-built server check that the disks aren't all from the same batch.
It used to be worse: all the drives in a RAID setup had to have the exact same specifications or the thing wouldn't work, which pretty much guaranteed near simultaneous failure of multiple drives, but even today, with somewhat more flexible software raid setups, it's still a problem.
At a place I used to work we used to joke that a drive failure warning from a RAID controller was nothing more than a signal to get out the backup tapes and start building a new server.
A good RAID controller won't let a drive with bad sectors continue to operate silently in an array. Once an unreadable sector is detected, the drive is failed immediately, period.
The problem is in the detection, but good RAID controllers "scrub" the whole array periodically. If they don't, or if you are paranoid like me, the same can be accomplished by having "smartd" initiate a long "SMART" self-test on the drives every week.
Good controllers will even fail drives when a transient error happens, one which triggers a bad block reallocation by the drive, for example. This is what makes some people fix failed drives by taking them out and putting them back in. After a rebuild the drive will operate normally without any errors, but you are putting yourself at a serious risk of it failing during a rebuild if another drive happens to fail, so DON'T do this.
Some others will react differently to these transient errors. EMC arrays, for instance, will copy a suspicious drive to the hot-spare and call home for a replacement. This is much faster than a full rebuild, but also much safer because it doesn't increase the risk of a second drive failing while doing it.
Oh, and did I mention that cheap drives lie?
Avoid using desktop drives on production servers for important data, even in a RAID configuration, if you don't have some kind of replicated storage layer above your RAID layer (meaning you can afford recovering one node from backup for speed and resync with the master to make it current).
I also found that higher-end drives lie, I used SAS Nearline drives that failed easy and often and I used standard SATA drives that were more resilient. It depends on the vendor and make. May also depend on the batch but I never found a proof for that in my work.
A bad block reallocation can be seen as a transient error from the controller's perspective, but it isn't silent provided the drive doesn't lie about it (and one would expect that a particular storage system vendor doesn't choose - and brand - drives that lie to their own controllers).
The storage system may ignore medium errors that force a repeated read (below a certain threshold), but they shouldn't ignore a medium error where the bad sector reallocation count increases afterwards (which is just another medium error threshold being hit, this time by the drive itself).
I'm not saying that higher-end drives are more reliable or not. Given that most standard SATA errors go undetected for longer, one could even argue that higher-end drives seem to fail much more frequently... I've had more FC drives replaced in a single EMC storage array than in the rest of the servers (which have a mix of internal 2.5in SAS and older 3.5in SCSI320 drives), and we certainly replace more drives in servers than desktops.
But that's another topic entirely.
Bottom line is that reliably storing data is more complicated than just writing it on to a disk.
The author is also wrong when saying "a non-recoverable read error [is] a function of disk electronics not a bad block". An NRE can happen for different reasons, one of them is when (data and error-correction) bits in the block get corrupted in a way that prevent the error-correction logic from detecting this error. So the block is technically bad, just not bad enough to cause the drive logic to declare it as a read failure.
I have actually used 'defective' enterprise disks in consumer systems for years after they were labeled defective by storage system vendors. About a decade ago, I used to buy such defective enterprise disks in bulk at auction from server and storage manufacturers and sold them as refurbished disks to consumers after testing.
I really don't understand this skewed perception of consumer- vs enterprise-grade harddrives. Do you believe that enterprise CPUs are more reliable than consumer CPUs? How about enterprise NICs vs consumer NICs?
Consumer-grade drives are sold in volumes so much larger than enterprise-grade drives, that vendors have strong incentives to make them as reliable as possible. I would even say they have incentives to make them more reliable than enterprise-grade drives. Because a single percentage point improvement in their reliability will drastically reduce the costs associated to warranty claims and repairs.
My own experience confirms the CMU study. I have worked at 2 companies selling each about 2-5 thousand drives as part of appliances, to customers across the world. One company was using SCSI drives, the other IDE/SATA. And the replacement rates were similar.
I can see your point about the usage being different which could invalidate the CMU findings about consumer vs enterprise drive reliability. But I don't personally believe it explains it. The CMU study + my annecdotal evidence one 2-5 thousand drives + the fact that no study has ever showed data suggesting enterprise drives are more reliable, makes me think that they are not.
[1] http://static.googleusercontent.com/external_content/untrust...
One outfit I worked decided to stick a brand new APC UPS in the bottom of the rack as it was in their office. It promptly caught fire and burned the entire rack out. The fire protection system did fuck all as well other than scare the shit out of the staff. Scraping molten cat5 cables off with a paint scraper was not fun.
Fortunately it was all covered by the DR procedure. Tip: write one and test it. That's more important than anything.
Basically, you keep multiple copies of the same data across different clusters of hardware. If a drive or two (or ten) go bad, just replace them, there is no rebuild time. Sure, it costs some disk space to keep n copies of data, but drives are just getting cheaper and there are de-duplication schemes being developed to help with this. Its not like RAID-6 is super efficient either.
Just my two cents...
for raid in /sys/block/md*/md/sync_action; do
echo "check" >> ${raid}
does that fix the issue? i run that once a week. i thought i was ok. am i not? if i am, isn't this old news?The OpenSource version of it, or the BSD version of it is only up to v28. And it seems after that Oracle is no longer putting out update as open source and what will happen after that? Disparity between Oracle version and BSD version? And are features still being developed? Most of limitation listed in Wiki hasn't change at all for the past years and are still listed as under development.
Raid Is Not A Backup!
Oh, and your raid controller should monitor for smart errors and you should seek to replace disks when you start seeing sector rewrites.
But from the beginning of TFA, after reading this:
"Bad blocks. Two of them. However, as the blocks aren't anywhere near the active volumes they go largely undetected."
The FIRST* thing that came to my mind was: "What!? Isn't that a long-solved problem!? Aren't disks / controllers / RAID setups much better now at detecting such problem right away".
I've got a huge issue with the "largely undetected". I may, at one point, need storage for a gig I'm working on. And I certainly don't want stuff problems like that to go "largely undetected".
So quickly skipping most of the article and going to the comments:
"It's worth pointing out that many hardware RAID controller support a periodic "scrubbing" operation ("Patrol Read" on Dell PERC controllers, "background scrub" on HP Smart Array controllers), and some software RAID implementations can do something similar (the "check" functionality in Linux md-style software RAID, for example). Running these kinds of tasks periodically will help maintain the "health" of your RAID arrays by forcing disks to perform block-level relocations for media defects and to accurately report uncorrectable errors up to the RAID controller or software RAID in a timely fashion."
To which the author of TFA himself replies:
"Yes, that is something I should have made clearer. This is the very reason that RAID systems have background processes that scan all the blocks."
Which leaves me all a bit confused about TFA, despite all the shiny graphs.
Basically, I don't really understand the premises of "bad blocks going largely undetected" in 2013...
A quick self test every day for all disks, and a long (i.e. full read) self test once a week.
The RAID is then checked on top of that one a month (although that slows things down a bit).
The combination of BMS and disk scrubbing at the RAID level should handle almost all of the issues that are pointed by the original post.
Though RAID scrubs can and do take a long time to complete, depending on the performance impact that you are willing to suffer on a continuous basis it can take a week or two to perform proper scrubbing.
Proper scrubbing would include not just reading the RAID chunk on a disk but to also read the other associated chunks from the other disks and verify that the parity is still intact. In RAID5 you will not be able to recover if the parity is bad as you won't know what chunk has gone bad.
I've been coding such systems for a while now and as a shameless plug would point to http://disksurvey.com/blog/ if there are things of interest I'd be happy to take requests and write about them as well.
The article assumes no scrubbing, which is a stupid thing to run without, as detailed from the article. So it's basically "why pointing a gun at your foot and pulling the trigger is bad", "because you're going to shoot yourself in the foot".
daily_status_zfs_enable="YES" daily_scrub_zfs_enable="YES" daily_scrub_zfs_default_threshold="6" # in days
and it will scrub the pools every 6 days (and send you a report in the daily run output).