Our Server’s Hard Drive is Dead. We didn’t have a backup. (opens in new tab)

(blog.method.ac)

55 pointsrhdoenges13y ago84 comments

84 comments

It's noble of you to come clean and own your mistake, but let me say this over and over:

You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.

If it makes you feel any better, I recently had to clean up a mess in a huge enterprise IT shop, (if I were to name the organization you would immediately know them) involving hundreds of thousands of man-hours of work lost due to a lazy, incompetent DBA and the clueless management above her.

This "DBA" was the kind of person who came in at 9:45AM, took a 2 hour lunch at noon, and left at 3:30. Did I mention she refused a work from home option?

She didn't know how to do chron jobs, so all of her backup scripts had to be run manually. If she was on vacation, they didn't get run. Surprise Surprise, the DB died after her long pre-Christmas vacation. Zero backups for the first 3 weeks of December.

Even "professionals" can be suspect sometimes.

harshreality13y ago

Running cronjob backups and looking at them in passing to see that they look like valid backups is not sufficient for any serious website or web service.

Automated backups need automated backup restoration and testing. Otherwise, the backups might not be created properly, or they might be perfect backups that have some hidden error that will cause them to fail when they're put to use.

As an example, Jeremiah Wilton's self-case study on Amazon's Oracle database problem in 1997. http://www.bluegecko.net/download/disaster-diary.pdf

Other than the one missed backup, backup procedures were fine. An Oracle bug caused Oracle to refuse to start due to a database format/schema change weeks earlier. TESTING backups would have caught the error, and allowed them to fix it before they took down their production database and triggered the bug on the next attempt to start it again.

McGlockenshire13y ago

> You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.

Even if they do know how to safeguard the data, that doesn't mean that everything else is going to work properly.

I had recently taken over IT after working for six years as a developer. In fact, this happened only a month or so in to my new role.

Our mail server died. Three out of four drives in the hardware RAID 10 failed. I'd been seeing bounces to root@localhost from root@localhost in the nightly reports, but the way things were configured made it nearly impossible to figure out where the mails were coming from. Thanks, Zimbra. We speculate that these were constant alerts from our RAID card notifying us of the impending disaster.

Oh, and the only backups for the mail store were on the machine itself, and in the local Thunderbird installs that half the company used instead of the Zimbra web interface. The machine was in a colo downtown, not local, and running backups over our pathetic little DSL connection was unmanageable.

Both of these things were known problems, both marked high priority, but both months away from being addressed when things went south.

This happened on a Friday. By Monday morning, I'd moved us over to a hosted service, manually sorted all of the mail that hit a catch-all mailbox on a VM I'd set up. By Tuesday, I'd audited every one of our other machines to make sure that mail to root was deliverable (it wasn't in about a dozen machines) and that every machine with hardware RAID had both local and remote monitoring.

Some people, including Directors and C-levels, lost up to ten years of mail. It was the worst IT disaster the company ever faced. But that's not the worst part. No, the worst part is that we're in the IT industry, and knew the entire time that what we were doing was wrong... fixing it had just never been prioritized before, because it wasn't seen as super urgent that it be fixed.

That lesson has been learned.

verelo13y ago

Sounds like they need a technical founder. Stock well spent imho...

gregd13y ago

"We need to get programming talent on-board." Sounds to me like they still haven't learned their lesson...

fixxer13y ago

I read that more as they don't know the difference between programming and ops.

jrussbowman13y ago

glad I wasn't the only one who had that as a first thought. as an ops guys I wasn't sure if I should be offended or just shake my head at the irony of it.

gregd13y ago

Well apparently someone didn't like my comment since it got down-voted. Whatever. As someone with a background in systems administration, it bothers me when this profession gets left out of the equation, far too often.

4 more replies

jthol13y ago

What do you mean?

gregd13y ago

I mean that Systems Administration is as thoroughly exhausting a career path as programming. They didn't lose information because they coded something wrong or inadequately.

3 more replies

ashray13y ago

I don't understand, you had a backup HD - that means you had a RAID setup. Why didn't your host replace the damaged hard disk ? In my experience hosts usually monitor RAID health on their servers and if there is a problem they replace the bad hard drives at the quickest opportunity.. and I'm talking about budget hosts.

EDIT: Too many to respond to below so just editing in here. The author mentioned that the primary hard disk had failed over a year ago - but he didn't know about that (the host informed him of this... now?). That points to a RAID setup where the mirror was basically working all this while. That's what I'm talking about in this post.

Sanddancer13y ago

Repeat after me: RAID is not a backup. It will mitigate certain drive failures, but is not a backup. Period, end of statement. Controllers will forget their info, OSes will eat their partition tables, and will otherwise ruin your data. If you're not backing up your stuff, to a completely separate system, preferably to a completely separate service, you will lose data. Period.

Udo13y ago

I don't think that's what ashray meant. A backup (as in failover) HD, not an HD that stores backups... Because apparently to add insult to injury, in this case they ignored a failed RAID drive and didn't have backups.

dangrossman13y ago

When you buy unmanaged servers, the host isn't monitoring RAID health -- they don't have any remote access to your machine except maybe IPMI for reboots. I've rented servers from various providers for a decade and none has ever monitored my hard drives... plenty have failed, including disks in a RAID and RAID adapters themselves; they get replaced when I call up and tell someone the server won't boot and I need someone to go take a look.

ashray13y ago

You're right, I seem to have gotten lucky with my unmanaged hosting (3 times over with different hosts). They seem to have some sort of hardware interface to monitor RAID health, of course, this is hardware RAID so maybe that's where the setup differs. I was surprised when I received an email from them one morning about a year ago saying "Hey, one of your RAID drives failed so we replaced it, just FYI".

It's true that a RAID failure may go unnoticed by a sysadmin for a year or more if they don't have proper checks setup for themselves.

I guess the only thing that could've been done in this case was to have a backup cronjob or use a provider that takes care of this stuff..

gregd13y ago

Thank you. This was exactly what I was getting at. 1&1 is on the short list of budget hosts that are notorious for doing the bare minimum to get your money.

gregd13y ago

"The backup HD" doesn't necessarily equate to "that means you had a RAID setup". How did you arrive at that conclusion? Budget hosts are notorious for having little to no backup solutions in place simply because, well I don't really know why? Costs? Complexity?

ashray13y ago

The author said that the primary hard disk failed and the host had told him there was a backup hard disk that hummed along for a year.

I cannot imagine any other practical situation in which a 'backup hard disk' would automatically kick in - apart from a RAID setup.

I know that budget hosts do not backup data off site, but they do tend to maintain their hardware, RAID arrays, etc.

However, I admit that I am all too unfamiliar with shared hosting environments in today's day and age, it's either a cloud or dedicated server for me - and for my budget dedicated servers, hosts have been pretty proactive about replacing bad hard drives.

garysweaver13y ago

http://www.overclock.net/t/1254683/why-raid-is-not-a-backup-...

GalacticDomin8r13y ago

Repeat after me: RAID is not a backup.

femto13y ago

Chances are that most of the information is still physically there, just that it is inaccessible. First thing I would do is physically obtain the drive, so even if it's inaccessible you have the information in your possession.

From your blog post, I'd assume you don't have the knowledge to attempt recovery yourself, so call in an expert to handle the data recovery for you. At this stage, it is a matter of what the information is worth to you, compared to the cost of recovery. Almost any intervention is possible, for a price.

pdeuchler13y ago

I feel for you. I really do. You did a lot of things right: learned how to program, bootstrapped your startup, released a product (!), got users, went viral, etc. etc.

But.

1) All of this could have been solved with money, specifically money used to pay professionals. You got 30 THOUSAND signups and you didn't think of trying to get funding? I'm surprised VC's weren't pounding at your door. At the very least, that might even be enough for a bank loan from a savvy lender. Hell, you could probably find a recently graduated ('tis the season) CSCI student willing to just take sweat equity with those numbers. This is especially frustrating for me as I currently have a startup that recently garnered a whopping 400 (count 'em!) hits on it's signup page, and yet I still got emails from people trying to invest. Not nearly platinum tier, and thus far none have panned out, but still!!!

2) You claim to have worked in web design/development for a while, and you didn't hear about 1&1's horrific reputation? That's hard for me to believe. In fact, of any community, the PHP/JS crowd is probably most familiar with being burned by 1&1. (Not even going into the slimy overselling).

I hate to say it, but you should have known better. That said, I sincerely wish you the best of luck. You've succeeded pretty spectacularly thus far, and in the big scheme of things this is a pretty minor setback. Just keep shipping and you'll get it eventually.

Edit: I realize that it might seem foolish to some to go after funding when it's not needed, but I would argue that if you are making it up as you go along (not an indictment, it's how we learn) and you get these kind of numbers, you should feel at least a little obligated to your users to secure your product. If that requires money that you don't have, get funding.

sc00ter13y ago

Not to be depended on as a substitute for backup, but 'dead' doesn't necessarily mean dead. Forensic recovery (either DIY or professional, depending on the nature of the failure) may still be an option.

Logic board failures are common, and replacements cheap (the cost of a new HD of the same model), data can be highly recoverable from soft failures. Mechanical failure is the worst case, but as long as the platter(s) is/are in tact, not insurmountable.

tempestn13y ago

"Technical support informed me that my first HD died 20 days into my contract. The backup HD hummed along for a year."

That sounds more like RAID than a backup HD.

gregd13y ago

This doesn't sound like RAID at all. How are you guys coming to this conclusion? At best it sounds like a server with a master/slave hard drive setup...you know, something from the early 2000s.

ams611013y ago

Two mirrored disks is technically a RAID level that can survive failure of one disk. I worked at a company in the late 1990s that had mirrored disks, they would "split" the mirror at the end of the business day, backup to tape from one disk and run nighttime batch jobs on the other disk. Before start of next business day they would backup the batch disk to tape, then resync the mirror.

Brandon013y ago

A master/slave hard drive setup would have data loss. If the main hard drive died and it wasn't a RAID, surely they would have noticed.

underwater13y ago

Mirrored disks are RAID 1: http://en.wikipedia.org/wiki/Standard_RAID_levels

marcosdumay13y ago

Looks like RAID 1.

You probably expected something like RAID 5, well, you are right, it's not.

gregd13y ago

Where did it say these were mirrored drives?

jefe7813y ago

"We messed up bad. We launched without having a backup procedure in place, and without the resources to make it happen. This was a hard-learned lesson that won’t happen again. We have no one except ourselves to blame."

You know how you messed up? By not using something like AWS - EC2 - Snapshots. Or even S3 or Glacier. What is this trend of devs doing Operations? As a Sysadmin with a Compsci/dev background, it blows my mind constantly.

Great, you know how to move around the CLI, but are you versed in how to maintain a proper and robust system?

Also, why weren't you using something like SES for your email alerts?

toomuchtodo13y ago

Long story short, money was tight, priorities were set incorrectly, and they got fucked.

jefe7813y ago

Sounds about right. My favourite part was leasing a server from 1&1. Even a little industry knowledge with regards to infrastructure would have caused someone to avoid them.

thaumaturgy13y ago

a. My business has a partnership with a good data recovery outfit. We might be able to get you a good deal on a data recovery if you want to try going that route.

b. It takes a particular kind of personality to be good at sysadmin work. (And a lot of trial-and-error -- I just recently had to do an emergency server build due to a Debian update whoops, and I've been doing this stuff for a while.)

c. I usually recommend BackupPC (http://backuppc.sourceforge.net/) for easy set-it-and-forget-it backup infrastructure. It's compatible with everything, it will notify you if there are problems, it does pooling and de-duplication and compression, it's fast and reliable, and you can usually store months of backups on a small offsite server. I store 12 months of all hosted and customer data with it, and we've used it to meet other clients' needs too.

d. If you need affordable help, let me know. I'm way too cheap, and I do this stuff all day, every day. I opened a business specifically to address problems like this: needs something, money is a problem.

That goes for anybody else too. If your lack of backups is keeping you awake at night, or if you've suddenly outgrown your infrastructure, or if looking at config files gives you an ulcer, get in touch with me. I'll help you out.

Sealy13y ago

How many people did it affect? I would be too ashamed to admit it if i was a company offering services for programmers but didn't back up my server.

nwilkens13y ago

I see this too many times.. and have read about this more than once on HN in recent memory.

Hire a proper system administration company early to work with you on these types of things. There are many companies out there that do this. I happen to run a company that does this, so I know that you can add an expert admin to your team for $100-200/mo.

Brandon013y ago

That is actually surprisingly cheap. Care if I ask what types of services one would get at those rates?

You're absolutely right though, for a company like OP's, if they are so short on cash, it makes a lot of sense to get someone in even if just for the week to address these types of fundamental problems.

nwilkens13y ago

For a monthly service, you generally receive an initial: - System architecture review - Backup strategy / DR review - Security scan, and detailed review - System monitoring design, and implementation

and on-going: - 24x7 monitoring, and response to outages - Server patch management - Ad-hoc system admin time available to be used on-demand

Many more details, and capabilities, but you get the idea ;)

rorrr213y ago

> you can add an expert admin to your team for $100-200/mo

Hahaha.

nwilkens13y ago

It's what we do -- so I laugh with you ;)

2 more replies

foobarbazqux13y ago

This site is an excellent way to find out if you've covered all your bases in your backup protocol:

http://www.taobackup.com/

femto13y ago

Note: the website is an extended advertisement for a piece of backup software, and the user account was created 3 minutes before the comment was posted.

foobarbazqux13y ago

Yeah, that's true. But then do you really think those guys are spamming HN at 8 p.m.?

I wish I had said, if you ignore the advertising, it's a great resource. If you apply it to proposed backup solutions, it's an effective means to find out if they are viable.

1 more reply

hoodoof13y ago

http://www.codinghorror.com/blog/2009/12/international-backu...

ironchef13y ago

On the plus side, this will probably only ever happen to you once. Once you've felt the pain, you'll never let it happen again.

ams611013y ago

However remember that backups are only half of the disaster recovery picture. You need to have a tested restore process ready to go as well.

ironchef13y ago

Automated and tested is even better :)

cdvonstinkpot13y ago

I wonder if it still would've happened if they were SSDs...

I tend to think they're safer due to no moving parts.

gregd13y ago

Actually it's been my experience that SSDs are less reliable than HDDs...

protomyth13y ago

And for the love of all you hold holy, do not ever put drives bought in the same batch in the RAID at the same time. They tend to fail at the same time. Check those serial numbers first.

2 more replies

macarthy1213y ago

Spinrite !

j / k navigate · click thread line to collapse

84 comments

JPKab13y ago

It's noble of you to come clean and own your mistake, but let me say this over and over:

You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.

This "DBA" was the kind of person who came in at 9:45AM, took a 2 hour lunch at noon, and left at 3:30. Did I mention she refused a work from home option?

Even "professionals" can be suspect sometimes.

harshreality13y ago

Running cronjob backups and looking at them in passing to see that they look like valid backups is not sufficient for any serious website or web service.

As an example, Jeremiah Wilton's self-case study on Amazon's Oracle database problem in 1997. http://www.bluegecko.net/download/disaster-diary.pdf

McGlockenshire13y ago

> You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.

Even if they do know how to safeguard the data, that doesn't mean that everything else is going to work properly.

I had recently taken over IT after working for six years as a developer. In fact, this happened only a month or so in to my new role.

Both of these things were known problems, both marked high priority, but both months away from being addressed when things went south.

That lesson has been learned.

verelo13y ago

Sounds like they need a technical founder. Stock well spent imho...

gregd13y ago

"We need to get programming talent on-board." Sounds to me like they still haven't learned their lesson...

fixxer13y ago

I read that more as they don't know the difference between programming and ops.

jrussbowman13y ago

glad I wasn't the only one who had that as a first thought. as an ops guys I wasn't sure if I should be offended or just shake my head at the irony of it.

gregd13y ago

4 more replies

jthol13y ago

What do you mean?

gregd13y ago

I mean that Systems Administration is as thoroughly exhausting a career path as programming. They didn't lose information because they coded something wrong or inadequately.

3 more replies

ashray13y ago

Sanddancer13y ago

Udo13y ago

dangrossman13y ago

ashray13y ago

It's true that a RAID failure may go unnoticed by a sysadmin for a year or more if they don't have proper checks setup for themselves.

I guess the only thing that could've been done in this case was to have a backup cronjob or use a provider that takes care of this stuff..

gregd13y ago

Thank you. This was exactly what I was getting at. 1&1 is on the short list of budget hosts that are notorious for doing the bare minimum to get your money.

gregd13y ago

ashray13y ago

The author said that the primary hard disk failed and the host had told him there was a backup hard disk that hummed along for a year.

I cannot imagine any other practical situation in which a 'backup hard disk' would automatically kick in - apart from a RAID setup.

I know that budget hosts do not backup data off site, but they do tend to maintain their hardware, RAID arrays, etc.

garysweaver13y ago

http://www.overclock.net/t/1254683/why-raid-is-not-a-backup-...

GalacticDomin8r13y ago

Repeat after me: RAID is not a backup.

femto13y ago

pdeuchler13y ago

I feel for you. I really do. You did a lot of things right: learned how to program, bootstrapped your startup, released a product (!), got users, went viral, etc. etc.

But.

sc00ter13y ago

tempestn13y ago

"Technical support informed me that my first HD died 20 days into my contract. The backup HD hummed along for a year."

That sounds more like RAID than a backup HD.

gregd13y ago

This doesn't sound like RAID at all. How are you guys coming to this conclusion? At best it sounds like a server with a master/slave hard drive setup...you know, something from the early 2000s.

ams611013y ago

Brandon013y ago

A master/slave hard drive setup would have data loss. If the main hard drive died and it wasn't a RAID, surely they would have noticed.

underwater13y ago

Mirrored disks are RAID 1: http://en.wikipedia.org/wiki/Standard_RAID_levels

marcosdumay13y ago

Looks like RAID 1.

You probably expected something like RAID 5, well, you are right, it's not.

gregd13y ago

Where did it say these were mirrored drives?

jefe7813y ago

Great, you know how to move around the CLI, but are you versed in how to maintain a proper and robust system?

Also, why weren't you using something like SES for your email alerts?

toomuchtodo13y ago

Long story short, money was tight, priorities were set incorrectly, and they got fucked.

jefe7813y ago

Sounds about right. My favourite part was leasing a server from 1&1. Even a little industry knowledge with regards to infrastructure would have caused someone to avoid them.

thaumaturgy13y ago

a. My business has a partnership with a good data recovery outfit. We might be able to get you a good deal on a data recovery if you want to try going that route.

Sealy13y ago

How many people did it affect? I would be too ashamed to admit it if i was a company offering services for programmers but didn't back up my server.

nwilkens13y ago

I see this too many times.. and have read about this more than once on HN in recent memory.

Brandon013y ago

That is actually surprisingly cheap. Care if I ask what types of services one would get at those rates?

nwilkens13y ago

For a monthly service, you generally receive an initial: - System architecture review - Backup strategy / DR review - Security scan, and detailed review - System monitoring design, and implementation

and on-going: - 24x7 monitoring, and response to outages - Server patch management - Ad-hoc system admin time available to be used on-demand

Many more details, and capabilities, but you get the idea ;)

rorrr213y ago

> you can add an expert admin to your team for $100-200/mo

Hahaha.

nwilkens13y ago

It's what we do -- so I laugh with you ;)

2 more replies

foobarbazqux13y ago

This site is an excellent way to find out if you've covered all your bases in your backup protocol:

http://www.taobackup.com/

femto13y ago

Note: the website is an extended advertisement for a piece of backup software, and the user account was created 3 minutes before the comment was posted.

foobarbazqux13y ago

Yeah, that's true. But then do you really think those guys are spamming HN at 8 p.m.?

I wish I had said, if you ignore the advertising, it's a great resource. If you apply it to proposed backup solutions, it's an effective means to find out if they are viable.

1 more reply

hoodoof13y ago

http://www.codinghorror.com/blog/2009/12/international-backu...

ironchef13y ago

On the plus side, this will probably only ever happen to you once. Once you've felt the pain, you'll never let it happen again.

ams611013y ago

However remember that backups are only half of the disaster recovery picture. You need to have a tested restore process ready to go as well.

ironchef13y ago

Automated and tested is even better :)

cdvonstinkpot13y ago

I wonder if it still would've happened if they were SSDs...

I tend to think they're safer due to no moving parts.

gregd13y ago

Actually it's been my experience that SSDs are less reliable than HDDs...

protomyth13y ago

And for the love of all you hold holy, do not ever put drives bought in the same batch in the RAID at the same time. They tend to fail at the same time. Check those serial numbers first.

2 more replies

macarthy1213y ago

Spinrite !

j / k navigate · click thread line to collapse