* my company installed about 4500 Seagate Barracuda ES2 (500 GB, 750 GB and mostly 1TB) between 2008 and 2010. These drives are utter crap, in 5 years we got about 1500 failures, much worse than the previous generation (250 GB to 750 GB Barracuda ES).
* After replacing several hundred drives, we decided to switch boats in 2010 and went with Hitachi (nowadays HGST). Of the roughly 3000 Hitachi/HGST drive used in the past 3 years, we had about 20 failures. Only one of the 200 Hitachi drives shipped between 2007 and 2009 failed. Most of the failed drives were 3 TB drives, ergo the 3 TB HGST HUA drives are less reliable than the 2 TB, themselves less reliable than the 1 TB model (which is by all measure, absolutely rock solid).
* Of the few WD drives we installed, we replaced about 10% in the past 3 years. Not exactly impressive, but not significant either.
* We replaced a number of Seagate Barracudas with Constellations, and these seem to be reliable so far, however the numbers aren't significant enough (only about 120 used in the past 2 years).
* About SSDs: SSDs are quite a hit and miss game. We started back in 2008 with M-Tron (now dead). M-Tron drives were horribly expensive, but my main compilation server still run on a bunch of these. Of all the M-Tron SSD we had (from 16 GB to 128 GB), none failed ever. There are 5 years old now, and still fast.
We've tried some other brands: Intel, SuperTalent... Some SuperTalent SSDs had terrible firmware, and the drives would crash under heavy load! They disappeared from the bus when stressed, but come back OK after a power cycle. Oh my...
So far unfortunately SSDs seem to be about as reliable as spinning rust. Latest generations fare better, and may actually best current hard drives ( we'll see in a few years how they retrospectively do).
http://venturebeat.com/2011/03/07/western-digital-buys-hitac...
1. Infant mortality. Drives fail after a couple months of use.
2. 3 year mark. This is where fails begin for typical work loads.
3. 4-6 year mark. This is when you can expect the drives that haven't failed earlier to fail. By this point, we're looking at 33% fail.
Interesting that my experiences roughly match up with Chart 1.
My experiences are 10 to 15k SAS drives. Slower moving 7200rpm drives? No idea. Haven't used them in servers in a while. They seem more of a crapshoot to me. SSD's, thus far, are even more of a crapshoot and we don't use them in servers and only hesitantly in desktops/laptops and only Intel.
It is very disappointing how flaky and unreliable SSD devices have been when their promise was just the opposite, due to lack of moving parts.
Back in 1999/2000 I had a habit of building some personal as well as commercial servers in datacenters with compact flash parts (plain old consumer CF drives) as boot devices with the goal of fault tolerance in mind. There was a price to be paid in that these devices needed to be mounted, and run, read-only.
But they ran forever. I never had one part fail. Just plain old CF drives mated directly to the IDE interface.
Now fast forward to 2013 and new servers we deploy for rsync.net have a boot mirror made of two SSDs ... things have gone well, but our general experience and anecdotal evidence from other parties gives us pause.
One thought: an SSD mirror, if it fails from some weird device bug or strange "wear" pattern would fail entirely, since both members of the mirror are getting the exact same treatment. For that reason, when we build SSD boot mirrors, we do so with two different parts - either one current gen and one previous gen intel part, or one intel part and one samsung part. That way if there is some strange behavior or defect or wearing issue, they both won't experience it.
If you'd still follow up on your idea of using a read-only root like you did with CF cards and figured a safe place for the logs you could still use the SSDs in the same mode. Why not go that route?
It was a huge win for uptime.
I'd echo the sentiment seen elsewhere in the comments about Seagate drives vs. Hitachi drives. Both for SATA and NL-SAS. Hitachi 1TB were rock solid compared to Seagate.
* Most consumer drives over 2TB have extremely poor reliability. Just check any Amazon or Newegg review (DOA and early mortality show up with more frequency). Yes, I know using reviews are not accurate but since there is no public information of drive failure rates there is not really much to go on.
* The reduction of manufacturer warranty since Thailand floods. Surprise, they never changed it back to the original 3 year warranty.
If you have a large array of disks, there is nothing to really to worry about. If you have a small set of drives, spend a little extra and get the "Black" or RE drives with 5 year warranty. Avoid any "Green" drive.
Check your S.M.A.R.T data. Look at the head park number. (Load cycles I think it is called, can't look it up now). If it is a six digit number, you are in trouble. For a server you want if to be in the same order as to power ups. Anything else and you have to explain to yourself "why?"
Edit: adding. The 1TB and smaller greens were disasters. I ruined a lot of them. I was told all of the 2TB and up greens didn't have head park issues, but spent part of last week replacing a storage unit populated with 2TB greens when a spindle failed (>200 unrecoverable blocks) and found that some of the 2TB greens were load cycling into the 200000 range, others weren't running up. They were all identical models purchased at the same time. Maybe they had different firmware? I replaced hem with REDs. They aren't supposed to park and they won't try to recover a bad sector for more than a few seconds so the don't hang your RAID when they get bad sectors.
Some of them were also crippled in firmware so you couldn't use them in RAID1 arrays, but this might have changed.
http://www.newegg.com/Product/Product.aspx?Item=N82E16822236...
http://www.amazon.com/WD-Red-NAS-Hard-Drive/dp/B008JJLZ7G
been running 3x of these in a raid-5 NAS, no issues so far (not that it's any kind of indicator on a system which idles as a backup all day)
1) select the make and model of drive you want
2) buy the same model of drives from multiple vendors which have different serial and build numbers.. even if you're buying two drives, buy each from separate locations or vendors.
3) mix up the drives to make sure they don't die. place stickers of purchase date and invoice number on each drive to keep them straight.
This all.. because when one drive goes due to a defect or hitting a similar MTBF, other ones with a close by serial number or build number can tend to die around the same time for similar reasons.
From owning hard drives over 8 or 9 generations of replacing or upgrading since the 90's on all types of servers, desktops and laptops: The day you buy a new piece of equipment is the day you buy it's death. Manage the death proactively as it gets more and more tiring to deal with it each time.
Drives have died for me both in 24/7 powered systems and through power cycles. Drives have reported intermittent failures for many months, but still lived for years without any actual data loss. The oldest drive I still have spinning is a 200G IDE containing the OS for my old OpenSolaris zfs NAS; must be getting on for 9 years.
I advise having a back-up of every drive you own, preferably two. I built a new NAS last week, 12x 4G drives in raidz2 configuration; with zfs snapshots, it fulfills 2 of the 3 requirements for backup (redundancy and versioning), while I use CrashPlan for cloud backup (distribution, the third requirement). Nice thing about CrashPlan is my PCs can back up to my NAS as well, so restores are nice and quick, pulling from the internet is a last resort.
Incidentally, about "consumer-grade drives", the last time I looked into this, I was led to believe that if it's SATA and 7200RPM (or less), there's no hardware distinction. It's just firmware. Consumer drives try very hard to recover data from a bad sector, while Enterprise/RAID drives have a recovery time limit to prevent them being unnecessarily dropped from an array (which will have its own recovery mechanisms). That's it.
There is a long feature reference that mentions things like: higher RPM, more quality, larger magnets, air turbulance control, dual processors, etc.
I'm not a spec in hard drives, just that I remember reading this stuff when trying to figure out do I need it. In the end, For my small-scale corporate file server, I chose zfs raidz with consumer grade disk drives.
[1] Enterprise-class versus Desktop-class Hard Drives: http://download.intel.com/support/motherboards/server/sb/ent...
They even admit to the problem themselves at the end:
"Some hard drive manufactures may differentiate enterprise from desktop drives by not testing certain enterprise-class features, validate the drives with different test criteria, or disable enterprise-class features on a desktop class hard drives so they can market and price them accordingly. Other manufacturers have different architectures and designs for the two drive classes. It can be difficult to get detailed information and specifications on different drive modes."
That PDF tells me nothing interesting. It's marketing crap for clueless executives, not a technical analysis. (Given their absurd obsession with "Higher RPM" as some sort of defining characteristic, it's not even relevant to the statement I made in the first place.)
Certainly the old 9.1Gb SCSI disks that were so popular 10 years ago are well past be justifiable to give power to now.
But these drives will still be useful. What about, say, shipping them to ONGs located in Africa?
But there are other considerations:
* This would also result in a big pile of waste in Africa, as their recycling infrastructure is limited.
* They need food, shelter, stable politics and functional education before they can make any use of computers.
* They have limited energy supply. Low powered tablets / laptops are much more useful.
Hard drive space per dollar grows exponentially, and they're big weighty things. The window of time where it would be economical to reuse would be short, and value dubious.
http://research.microsoft.com/pubs/144888/eurosys84-nighting...
I've been hit-and-miss, gotten a few drives replaced, had a few warranties expire. But pretty much every disk drive fails eventually.
Think about it - its a commodity. If it lasted much longer than the warranty, they spent too much on robustness for the price.
Proper statistical analysis would help you there.
Yes, if you know the probability distribution. If you don't know the distribution, you can not calculate your confidence, and thus can not do a proper statistical analysis.
And, guess what, nobody knows the probability distribution of hard drive failures. That's exactly what they are trying to find out.
Amazon could use some competition in this space, IMNHO.
80% drives surviving after 5 years seems right, this is what we're seeing as well. The hardware is decommissioned faster then the drives fail.
I'm not sure that the information would be all that valuable anyway. Google's data-center environments, workloads and requirements are likely pretty different than your environment and requirements, so I'm not sure how the information would be useful?
We are constantly looking at new hard drives, evaluating them for reliability and power consumption. The Hitachi 3TB drive (Hitachi Deskstar 5K3000 HDS5C3030ALA630) is our current favorite for both its low power demand and astounding reliability. The Western Digital and Seagate equivalents we tested saw much higher rates of popping out of RAID arrays and drive failure. Even the Western Digital Enterprise Hard Drives had the same high failure rates. The Hitachi drives, on the other hand, perform wonderfully.
In the second article they say that WD-RED are 2nd in reliability (WD-RED did not exist 2011). I'm happy that I've got a cheap Hitachi Ultrastar. But who knows.
As a personal anecdote: WD-Green failure rates are huge here. 24/7 Desktop machines, 240 drives. I've replaced in last 12 month at least 20 drives.
We replaced the drives with a different brand and the 'failures' went away.
Odd question, but I've always been wondering. These things just seem to hast forever.
I don't have any hardware to read my pile to 5.25 Atari 800 disks.
http://static.googleusercontent.com/external_content/untrust...
Also, no hardware raid, battery, or cap.
Source: worked at Eye-Fi, built 2PB storage
It is not true that the pod team must remove the 4U server from the rack. It is slid out like a drawer (no tools required, takes maybe 10 seconds). The drive or motherboard is then replaced, then you slide the drawer back in. So the 4U server must slide 18 inches one way, but zero cables have to be unplugged or replugged when done. This only takes one technician and no "server lift", the drawer supports all the weight.
I'm not defending this design, just correcting a mistake. Backblaze frankly "makes do" with this design because nobody will step up and make anything that fits our needs better. The number 1 criteria is total system cost over the lifetime of the system INCLUDING all the time spent on salaries of datacenter techs dealing with the pods. "raw I/O performance" is not that important for backup, so trying to sell us an awesome EMC or NetApp that costs 10x as much and has 10x the raw I/O performance is not very compelling to us. But if you came up with a design making it faster for our datacenter technicians to replace a drive faster while not significantly increasing overall costs in another area, we SURELY would listen.
While I don't recommend them outright, we settled on 3U boxes from SuperMicro. http://www.supermicro.com/products/chassis/3u/837/sc837e26-r...
We somewhat affectionately dubbed them "mullets" as in business in the front, party in the rear.
They make 4U devices as well. Cost was about $1000. We added LSI Megaraid 9280 controllers, about another $1500 and ran min-SAS back to a controller node responsible for 4 JBODs.
1. you have to muck around with more firmware and sometimes reboot in order for changes to take effect
2. if a controller dies, you have to replace it with (almost) the exact same controller in order to read the data
3. Datacenters rarely lose power, take the HW raid money and instead put servers on true A+B power feeds.
CPUs are so fast these days that they can easily handle in software all the "stuff" that HW raid used to do.
Their hardware design is specifically geared towards their use-case and I applaud them for knowing how to optimize for their use-case. I wouldn't use it for mine but only because it's not a good fit.
They can open-source the hardware because the real secret sauce is the software and the hardware open sourcing gives them a nice edge in marketing.
Edited to add: They've optimized for hardware purchase price and given up reliability (HW RAID, battery, cap), performance, and maintainability. The strange thing is the overall cost of the storage system is driven by power, not purchase price. Smarter RAID controllers, like I link above, let you manage power by spinning down disks as they are unused and thereby reducing your power draw. Can't do that with SW RAID that I've ever seen. Take a look at Amazon Glacier which I suspect is using this power-off strategy to drastically reduce their costs.
So SpinRite may be handy, but throw the drive away after use.
Not everyone lives in the EU. In fact, the majority of people don't.
Outside your regulation happy haven, warranty periods aren't random and do indicate durability under normal use.
Care to back that up with any real data instead of baseless consumer speculation relying on time travel?
Am I unaware that there are new paid spots on the first page of HN? (it would make sense I guess, from a business perspective)
TIA to anyone that can be of help on this, cheerio, (and good luck to Blackblaze, backblaze a path to a backblazing success!)
Don't like it? Don't vote for it.
I suspect the reason why people do "burn in" tests on hard drives is to make drives that suffer from early failure ("infant mortality" as described in the article) fail early enough that you can RMA them with the manufacturer. Apart from that, I don't think there's much you can do to improve your chances.
The article actually makes this point (about anecdote), but their data suggests that failure rates do rise substantially after three years.
The optical drives I've had, on the other hand, are actually unreliable. They all seem to break down after about four years, and I don't use them all that often!
There is an inherent bias in the reviews. Which makes the backblaze report so interesting, they have less of a bias though they do not report actual disk vendors and models to really draw direct inference only the general trend.
I think this is most evident in the reduced warranty periods compared to before when 5 years was quite normal.
Can't seem to find relevant information on the website anywhere for this.
A little statistics is a dangerous thing.