I Don't Need Backups, I Use Raid1 (opens in new tab)

(momjian.us)

58 pointskeyist13y ago51 comments

51 comments

I once started work somewhere where they did software releases by RAID.

Their systems involved shipping a server (effectively an appliance) to the customer with all of the working components on it. However, there was no build or deployment process for these components - so the only way to create a new server was to take an existing one and create a copy.

This was done by opening up a working server running with RAID 1, removing one of the disks and installing the disk into a new server. Let the RAID recover the data onto the other blank disk then remove it and put the other blank disk in and let it rebuild.... result, a copied server!

dan_b13y ago

http://thedailywtf.com/Articles/RAIDing_Disks.aspx

ericabiz13y ago

Same story; different company: http://www.erica.biz/2009/common-mistake-in-business/

It is amazing how even fairly technically-savvy people get sucked into the "RAID=backup" mentality. This story (in the above link) ended up costing the business owner tens of thousands of dollars.

furyg313y ago

Dud, Flood, & Bud.

Duds are hardware that goes bad, like a disk drive, network adapter, NAS, or server. There are an infinite number of ways and combinations things can break in a moderate sized IT shop. How much money / effort are you willing to spend to make sure your weekend isn't ruined by a failed drive?

Floods are catastrophic events, not limited to acts of God. Your datacenter goes bankrupt and drops offline, not letting you access your servers. Fire sprinklers go off in your server room. Do you have a recent copy of your data somewhere else?

Bud is an accident-prone user. He accidentally deleted some files... the accounting files... three weeks ago. Or he downloaded a virus which has slowly been corrupting files on the fileserver. Or Bud's a sysadmin who ran a script meant for the dev server on the production database. How can we get that data back in place quickly before the yelling and firing begins?

There are more possible scenarios (hackers, thieves, auditors, the FBI), but if you're thinking about Dud, Flood, & Bud, you're in better shape than most people are.

kator13y ago

We live in a sad world where most companies don't have a real disaster recovery plan. Many times in my career I've had customers ask me to save them because they "thought" they were backing up but when they went to restore from the {tape|floppy|backup disk} media they found it to be corrupt.

Backup and Disaster recovery strategies seem really easy until you think through all the failure modes and realize the old axiom "You don't know what you don't know" is there to make your life full of pain and suffering.

Years ago my customers would literally restore their entire environments onto new metal to verify they had a working disaster recovery plan. Today most clients think having a "cloud backup" is awesome.. Until they realize in the moment of disaster that they are missing little things like license keys for software, network settings, passwords to local admin on windows boxes etc.

pja13y ago

RAID is not a backup strategy. RAID is an availability strategy. Unfortunately, it appears that many people don't understand the distinction.

gaius13y ago

The community has discussed the idea of adding a feature to specify a minimum streaming replication delay

This is a feature of Oracle, the redo logs are replicated to the standbys as normal, so you have an up to date copy of them on the standby, but only applied after an x hour delay. You can roll the standby forward to any intervening point in time and open it read-only to copy data out.

Less need of it these days with Flashback, of course, but it saved a lot of bacon.

ErikD13y ago

Using mk-slave-delay you can do this with Mysql as well. We always have a slave running behind a day. You can fast forward the slave using the 'START SLAVE UNTIL' command.

gaius13y ago

Don't MySQL slaves pull from the master, rather than the master pushing? In the Oracle way, the transactions are on the standby, just not applied yet. In the event of a catastropic failure of the primary, you're still covered.

bstpierre13y ago

Most companies I've worked for have had some kind of annual fire drill / alarm testing. They announce it the prior week, and then, say, Tuesday at 10am the alarm goes off, everyone files out of the building into the parking lot for 5 minutes, then back inside. In 15+ years (at several different companies), only once has there been an actual fire department call where the evacuation was "real" (even then, there was no actual fire).

In those same 15+ years, mostly working for startups, there have been numerous drive failures. Unfortunately, failure (a) to verify backups before there's a failure, and (b) to practice restoring from backups has often meant that a drive failure means loss of several days' worth of work. In one instance, the VCS admin corrupted the entire repo, there were no backups, that admin was shown the door, and we had to restart from "commit 0" with code pieced together from engineers' individual workstations. That was when I got religious about making & testing backups for my work and the systems I was responsible for...

waivej13y ago

You must test your backups. I used a commercial backup service that sent daily status emails. It seemed great for months until I realized it had a bug and there was nothing in the archive.

wiredfool13y ago

Yep. It helps to think of it as: Backups aren't the end product. Successful restores are the end product.

Legion13y ago

Cloud backup services have taken away any possible excuse for not remotely backing up any non-ginormous collection of data. It's push-button easy and a lot cheaper and easier than dealing with taking tape backups and moving them offsite.

Not to say that it's the best solution for everyone, but simply that it leaves people no excuse for doing nothing.

hythloday13y ago

The underlying cognitive bias in "I don't need backups, I use raid1" seems to be the quite common one of "I don't do anything stupid, so I don't need anti-stupidity devices" (feel free to substitute "careless" or similar for "stupid"), maybe with a side-order of "if I set up systems that protect me from my stupidity then only stupid people will want to work with me". The fact is, most of us do many stupid things every day--some stupid at the time, some stupid in retrospect--and systems that don't let us recover from them are poor systems.

ender713y ago

Never underestimate your RAID controller's ability to fail (silently!) and start writing corrupted garbage to your disks.

wayne_h13y ago

I once did a RAID data recovery on a system with a high-end Intel raid controller. The controller failed and they sent a new one - only the new one couldn't assemble the raid properly. It turns out that there was a flaw in the logic for where parity was stored. Normally parity is spread evenly across all the drives - not on this version. I had to reverse-engineer the crazy raid pattern and write a program to deraid it. It had gone undetected - as long as it was running.

wayne_h13y ago

and don't underestimate the manufacturers ability to screw it up. Like when people 'upgrade' the firmware in their Buffalo NAS box and after that they can't see the data anymore. Luckily the data was still there and undamaged but it took data recovery to get it back.

its_so_on13y ago

EDIT: people didn't like my humor. Well, look, the whole thing that you're buying with a raid controller is...redundancy. So if it's not redundant, failing silently, while telling you it's being redundant, how is this different from, say, paying for a house inspection that doesn't get done? If a raid controller is allowed to silently fail, it becomes a post-experience good.

http://en.wikipedia.org/wiki/Experience_good

Meaning that even while you're using it, you have no idea if it works.

My contention is that it's not a raid array if it can silently stop being redundant without telling you.

At best it's an Possibly Redundant Array of Inexpensive Disks.

(The below is how my comment first read.)

(sarcastic) Yeah, it's only prudent to grab a drive out from time to time and make a surprise inspection of whether it's actually filled up a full 4/5th of the way (or whatever) with the actual data the volume is supposed to contain! And the remaining fifth had better look a damn sight like parity information!

Seriously though, a controller that fails like this isn't a RAID controller, since what separates it from a paper plate and a cardboard box. On the paper plate you write "RAID controller" and tape it to an already attached hard drive, and you put the remaining members of the redundant array into the cardboard box. No setup or even connection required!

seriously seriously though, what you're suggesting is unacceptable. that's not a raid controller, that's a scam

eli13y ago

Of course a RAID controller isn't supposed to fail silently, but it can and it does. I can't think of many complex pieces of technology that work 100% all of the time.

its_so_on13y ago

You don't think that something that only exists to create disk redundancy is in a different category from complex pieces of technology that don't have this in their name?

I simply disagree that you should "never underestimate" your raid controller's ability to fail silently (which is the comment I was replying to). If this is even on your radar you don't have a RAID controller.

This is literally like saying. "Never underestimate your digest algorithm's ability to hash the same file to different values, making the checksum seem to fail." That's not a digest algorithm, that's a randomized print statement.

A RAID controller you should 'never underestimate' the ability of to fail silently is literally sometimes the same as a paper plate with "raid controller" written on it. Call it "sometimes raid". or "maybe raid" or "more raid". You don't have a raid controller.

2 more replies

alexchamberlain13y ago

In case anyone is confused, what happens when the server catches fire or is stolen?

duck13y ago

Or even more likely, the RAID configuration is lost.

jeltz13y ago

Or more commonly according to the article: a user accidentally removes (and perhaps even shreds) the wrong file.

larrys13y ago

I've worked with tapes offsite before hard drives became cheap enough to use for backup (of the appropriate amount of data of course).

My current setup goes as follows:

Servers in colocation get backup daily to a server in the office. That server in the office then gets backed up daily to a iosafe.com fire and water proof hard drive in the office which when I get a chance will be bolted to the desk for further security. Clones are then made of that server biweekly (which are bootable) and one is kept in the office and one is taken offsite.

So the office server is the offsite for the colo server and the clone of that is the backup for the office.

The clones allow you to test the backup (hook it up and it boots basically).

Added: Geographically the office is about 3 miles from where the backup of the office is kept. But the office is about 40 miles from where the colo servers are kept.

RyanMcGreal13y ago

Fun anecdote: years ago, I worked for a department that had its server on a RAID setup, and when I asked about backups they said, "Don't worry". One day, a drive failed. They replaced it and started restoring from the other drive - which failed mid-sync. The two drives were from the same production lot and died literally within 12 hours of each other.

So: back up your data.

jeromeparadis13y ago

It happened to me. It happens the drives had a bug where there death time was hardcoded in the drives. Past a predetermined time of usage, they would fail. Of course, I never believed RAID was a backup strategy so I was able to recover.

wiredfool13y ago

I've had a raid controller fail by restoring as a raid 5 array where previously there was raid 10 array.

avgarrison13y ago

This is one of the problems I have with SQL Azure. They have yet to implement a satisfactory backup option: http://www.mygreatwindowsazureidea.com/forums/34685-sql-azur...

Spooky2313y ago

It's amazing to me that anyone is actually arguing that RAID negates the need for backup. That is just dumb.

If I ever heard an SA working for me advocate that position, I would probably get them off of my team ASAP.

eli13y ago

Maybe I'm an idiot, but the vast majority of times I've needed to recover something from a backup are due to user error, not hardware failure. RAID sure doesn't help there.

nviennot13y ago

A backup has not much value when stored in the same physical location with the original data. Any fire/flood/robbery will destroy all the data.

mike-cardwell13y ago

Of course it has value; fast recovery of data after hardware failure.

You still want off-site backups as well of course, in case of something more extreme, but they're usually going to be slower to recover from than nearby backups.

jemeshsu13y ago

I was burnt once when both hard disks in my Raid1 fail at the same time, unlikely but it happened. And Raid is not a backup strategy.

trapexit13y ago

Not as unlikely as many people would like to think! If the two drives are from the same production lot, they may suffer from a common manufacturing defect. And because they are in the same chassis, if a server fan fails, both drives may subsequently fail due to thermal damage within a very short interval.

Even if they don't fail simultaneously, the mirror drive may fail (or even more likely) have read errors or flipped bits that will corrupt the restore or render it impossible.

Personally, I don't place much trust in any RAID configuration other than RAIDZ2 (ZFS; you can lose two drives and still recover all your data; every block is checksummed to avoid reading or restoring corrupted data).

But even ZFS can't protect you against accidental deletion, fire, theft, or earthquake.

mietek13y ago

There's RAIDZ3 now!

wayne_h13y ago

I do RAID data recovery for a living. Customers always think that '2 drives failed at the same time'. Thats not what usually happens. What really happens is that 1 drive fails and the raid does what it supposed to do - it keeps on functioning. Sometimes for months. Then the second drive fails and it stops. They call in the IT guys and it appears that '2 drives failed at the same time'. Not.. Most raid data loss is caused by IT guys trying to fix it - because its a raid and its redundant and it cant be hurt so let me start swapping stuff and rebuilding and destroying...

zalew13y ago

redundancy != backup

dredmorbius13y ago

Actually, backups are redundancy.

You just have to structure your redundancy to survive multiple threat models.

In which case, the redundancy offered by RAID alone is grossly insufficient.

InclinedPlane13y ago

There is a very important difference. Redundancy doesn't protect you against bad changes to your data, backups do. Backups should ideally be immutable, and append only. What happens when a disgruntled employee runs 'sudo rm -rf /'? With redundancy the effects of that decision are dutifully cloned on all media. With backups one has the ability to rollback to an older state.

dredmorbius13y ago

Backups are redundancy out of firing range of problems like, say, hard drive meltdown, operator error, etc.

I've had gruntled employees, occasionally myself, run some variant of 'rm -rf' unintentionally far more often than I've had to deal with the other sort.

If you feel my grandparent post was advocating against backups, I'd strongly suggest you re-parse it. It's distinguishing between varieties of redundancy.

highfreq13y ago

Using RAID1 as backup is OK as long as you occasionally run 'sudo rm -rf /' for maintenance.

kayoone13y ago

for most of my stuff dropbox pro (with packrat addon for unlimited file histories) + github handle all my backup needs. Of course this wouldnt work for all scenarios but i dont work with/have loads of huge files.

j / k navigate · click thread line to collapse

51 comments

arethuza13y ago

I once started work somewhere where they did software releases by RAID.

dan_b13y ago

http://thedailywtf.com/Articles/RAIDing_Disks.aspx

ericabiz13y ago

Same story; different company: http://www.erica.biz/2009/common-mistake-in-business/

It is amazing how even fairly technically-savvy people get sucked into the "RAID=backup" mentality. This story (in the above link) ended up costing the business owner tens of thousands of dollars.

furyg313y ago

Dud, Flood, & Bud.

There are more possible scenarios (hackers, thieves, auditors, the FBI), but if you're thinking about Dud, Flood, & Bud, you're in better shape than most people are.

kator13y ago

pja13y ago

RAID is not a backup strategy. RAID is an availability strategy. Unfortunately, it appears that many people don't understand the distinction.

gaius13y ago

The community has discussed the idea of adding a feature to specify a minimum streaming replication delay

Less need of it these days with Flashback, of course, but it saved a lot of bacon.

ErikD13y ago

Using mk-slave-delay you can do this with Mysql as well. We always have a slave running behind a day. You can fast forward the slave using the 'START SLAVE UNTIL' command.

gaius13y ago

bstpierre13y ago

waivej13y ago

You must test your backups. I used a commercial backup service that sent daily status emails. It seemed great for months until I realized it had a bug and there was nothing in the archive.

wiredfool13y ago

Yep. It helps to think of it as: Backups aren't the end product. Successful restores are the end product.

Legion13y ago

Not to say that it's the best solution for everyone, but simply that it leaves people no excuse for doing nothing.

hythloday13y ago

ender713y ago

Never underestimate your RAID controller's ability to fail (silently!) and start writing corrupted garbage to your disks.

wayne_h13y ago

its_so_on13y ago

http://en.wikipedia.org/wiki/Experience_good

Meaning that even while you're using it, you have no idea if it works.

My contention is that it's not a raid array if it can silently stop being redundant without telling you.

At best it's an Possibly Redundant Array of Inexpensive Disks.

(The below is how my comment first read.)

seriously seriously though, what you're suggesting is unacceptable. that's not a raid controller, that's a scam

eli13y ago

Of course a RAID controller isn't supposed to fail silently, but it can and it does. I can't think of many complex pieces of technology that work 100% all of the time.

its_so_on13y ago

You don't think that something that only exists to create disk redundancy is in a different category from complex pieces of technology that don't have this in their name?

2 more replies

alexchamberlain13y ago

In case anyone is confused, what happens when the server catches fire or is stolen?

duck13y ago

Or even more likely, the RAID configuration is lost.

jeltz13y ago

Or more commonly according to the article: a user accidentally removes (and perhaps even shreds) the wrong file.

larrys13y ago

I've worked with tapes offsite before hard drives became cheap enough to use for backup (of the appropriate amount of data of course).

My current setup goes as follows:

So the office server is the offsite for the colo server and the clone of that is the backup for the office.

The clones allow you to test the backup (hook it up and it boots basically).

Added: Geographically the office is about 3 miles from where the backup of the office is kept. But the office is about 40 miles from where the colo servers are kept.

RyanMcGreal13y ago

So: back up your data.

jeromeparadis13y ago

wiredfool13y ago

I've had a raid controller fail by restoring as a raid 5 array where previously there was raid 10 array.

avgarrison13y ago

This is one of the problems I have with SQL Azure. They have yet to implement a satisfactory backup option: http://www.mygreatwindowsazureidea.com/forums/34685-sql-azur...

Spooky2313y ago

It's amazing to me that anyone is actually arguing that RAID negates the need for backup. That is just dumb.

If I ever heard an SA working for me advocate that position, I would probably get them off of my team ASAP.

eli13y ago

Maybe I'm an idiot, but the vast majority of times I've needed to recover something from a backup are due to user error, not hardware failure. RAID sure doesn't help there.

nviennot13y ago

A backup has not much value when stored in the same physical location with the original data. Any fire/flood/robbery will destroy all the data.

mike-cardwell13y ago

Of course it has value; fast recovery of data after hardware failure.

You still want off-site backups as well of course, in case of something more extreme, but they're usually going to be slower to recover from than nearby backups.

jemeshsu13y ago

I was burnt once when both hard disks in my Raid1 fail at the same time, unlikely but it happened. And Raid is not a backup strategy.

trapexit13y ago

Even if they don't fail simultaneously, the mirror drive may fail (or even more likely) have read errors or flipped bits that will corrupt the restore or render it impossible.

But even ZFS can't protect you against accidental deletion, fire, theft, or earthquake.

mietek13y ago

There's RAIDZ3 now!

wayne_h13y ago

zalew13y ago

redundancy != backup

dredmorbius13y ago

Actually, backups are redundancy.

You just have to structure your redundancy to survive multiple threat models.

In which case, the redundancy offered by RAID alone is grossly insufficient.

InclinedPlane13y ago

dredmorbius13y ago

Backups are redundancy out of firing range of problems like, say, hard drive meltdown, operator error, etc.

I've had gruntled employees, occasionally myself, run some variant of 'rm -rf' unintentionally far more often than I've had to deal with the other sort.

If you feel my grandparent post was advocating against backups, I'd strongly suggest you re-parse it. It's distinguishing between varieties of redundancy.

highfreq13y ago

Using RAID1 as backup is OK as long as you occasionally run 'sudo rm -rf /' for maintenance.

kayoone13y ago

j / k navigate · click thread line to collapse