You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.
If it makes you feel any better, I recently had to clean up a mess in a huge enterprise IT shop, (if I were to name the organization you would immediately know them) involving hundreds of thousands of man-hours of work lost due to a lazy, incompetent DBA and the clueless management above her.
This "DBA" was the kind of person who came in at 9:45AM, took a 2 hour lunch at noon, and left at 3:30. Did I mention she refused a work from home option?
She didn't know how to do chron jobs, so all of her backup scripts had to be run manually. If she was on vacation, they didn't get run. Surprise Surprise, the DB died after her long pre-Christmas vacation. Zero backups for the first 3 weeks of December.
Even "professionals" can be suspect sometimes.
Automated backups need automated backup restoration and testing. Otherwise, the backups might not be created properly, or they might be perfect backups that have some hidden error that will cause them to fail when they're put to use.
As an example, Jeremiah Wilton's self-case study on Amazon's Oracle database problem in 1997. http://www.bluegecko.net/download/disaster-diary.pdf
Other than the one missed backup, backup procedures were fine. An Oracle bug caused Oracle to refuse to start due to a database format/schema change weeks earlier. TESTING backups would have caught the error, and allowed them to fix it before they took down their production database and triggered the bug on the next attempt to start it again.
Even if they do know how to safeguard the data, that doesn't mean that everything else is going to work properly.
I had recently taken over IT after working for six years as a developer. In fact, this happened only a month or so in to my new role.
Our mail server died. Three out of four drives in the hardware RAID 10 failed. I'd been seeing bounces to root@localhost from root@localhost in the nightly reports, but the way things were configured made it nearly impossible to figure out where the mails were coming from. Thanks, Zimbra. We speculate that these were constant alerts from our RAID card notifying us of the impending disaster.
Oh, and the only backups for the mail store were on the machine itself, and in the local Thunderbird installs that half the company used instead of the Zimbra web interface. The machine was in a colo downtown, not local, and running backups over our pathetic little DSL connection was unmanageable.
Both of these things were known problems, both marked high priority, but both months away from being addressed when things went south.
This happened on a Friday. By Monday morning, I'd moved us over to a hosted service, manually sorted all of the mail that hit a catch-all mailbox on a VM I'd set up. By Tuesday, I'd audited every one of our other machines to make sure that mail to root was deliverable (it wasn't in about a dozen machines) and that every machine with hardware RAID had both local and remote monitoring.
Some people, including Directors and C-levels, lost up to ten years of mail. It was the worst IT disaster the company ever faced. But that's not the worst part. No, the worst part is that we're in the IT industry, and knew the entire time that what we were doing was wrong... fixing it had just never been prioritized before, because it wasn't seen as super urgent that it be fixed.
That lesson has been learned.
EDIT: Too many to respond to below so just editing in here. The author mentioned that the primary hard disk had failed over a year ago - but he didn't know about that (the host informed him of this... now?). That points to a RAID setup where the mirror was basically working all this while. That's what I'm talking about in this post.
It's true that a RAID failure may go unnoticed by a sysadmin for a year or more if they don't have proper checks setup for themselves.
I guess the only thing that could've been done in this case was to have a backup cronjob or use a provider that takes care of this stuff..
I cannot imagine any other practical situation in which a 'backup hard disk' would automatically kick in - apart from a RAID setup.
I know that budget hosts do not backup data off site, but they do tend to maintain their hardware, RAID arrays, etc.
However, I admit that I am all too unfamiliar with shared hosting environments in today's day and age, it's either a cloud or dedicated server for me - and for my budget dedicated servers, hosts have been pretty proactive about replacing bad hard drives.
From your blog post, I'd assume you don't have the knowledge to attempt recovery yourself, so call in an expert to handle the data recovery for you. At this stage, it is a matter of what the information is worth to you, compared to the cost of recovery. Almost any intervention is possible, for a price.
But.
1) All of this could have been solved with money, specifically money used to pay professionals. You got 30 THOUSAND signups and you didn't think of trying to get funding? I'm surprised VC's weren't pounding at your door. At the very least, that might even be enough for a bank loan from a savvy lender. Hell, you could probably find a recently graduated ('tis the season) CSCI student willing to just take sweat equity with those numbers. This is especially frustrating for me as I currently have a startup that recently garnered a whopping 400 (count 'em!) hits on it's signup page, and yet I still got emails from people trying to invest. Not nearly platinum tier, and thus far none have panned out, but still!!!
2) You claim to have worked in web design/development for a while, and you didn't hear about 1&1's horrific reputation? That's hard for me to believe. In fact, of any community, the PHP/JS crowd is probably most familiar with being burned by 1&1. (Not even going into the slimy overselling).
I hate to say it, but you should have known better. That said, I sincerely wish you the best of luck. You've succeeded pretty spectacularly thus far, and in the big scheme of things this is a pretty minor setback. Just keep shipping and you'll get it eventually.
Edit: I realize that it might seem foolish to some to go after funding when it's not needed, but I would argue that if you are making it up as you go along (not an indictment, it's how we learn) and you get these kind of numbers, you should feel at least a little obligated to your users to secure your product. If that requires money that you don't have, get funding.
Logic board failures are common, and replacements cheap (the cost of a new HD of the same model), data can be highly recoverable from soft failures. Mechanical failure is the worst case, but as long as the platter(s) is/are in tact, not insurmountable.
That sounds more like RAID than a backup HD.
You probably expected something like RAID 5, well, you are right, it's not.
You know how you messed up? By not using something like AWS - EC2 - Snapshots. Or even S3 or Glacier. What is this trend of devs doing Operations? As a Sysadmin with a Compsci/dev background, it blows my mind constantly.
Great, you know how to move around the CLI, but are you versed in how to maintain a proper and robust system?
Also, why weren't you using something like SES for your email alerts?
b. It takes a particular kind of personality to be good at sysadmin work. (And a lot of trial-and-error -- I just recently had to do an emergency server build due to a Debian update whoops, and I've been doing this stuff for a while.)
c. I usually recommend BackupPC (http://backuppc.sourceforge.net/) for easy set-it-and-forget-it backup infrastructure. It's compatible with everything, it will notify you if there are problems, it does pooling and de-duplication and compression, it's fast and reliable, and you can usually store months of backups on a small offsite server. I store 12 months of all hosted and customer data with it, and we've used it to meet other clients' needs too.
d. If you need affordable help, let me know. I'm way too cheap, and I do this stuff all day, every day. I opened a business specifically to address problems like this: needs something, money is a problem.
That goes for anybody else too. If your lack of backups is keeping you awake at night, or if you've suddenly outgrown your infrastructure, or if looking at config files gives you an ulcer, get in touch with me. I'll help you out.
Hire a proper system administration company early to work with you on these types of things. There are many companies out there that do this. I happen to run a company that does this, so I know that you can add an expert admin to your team for $100-200/mo.
You're absolutely right though, for a company like OP's, if they are so short on cash, it makes a lot of sense to get someone in even if just for the week to address these types of fundamental problems.
and on-going: - 24x7 monitoring, and response to outages - Server patch management - Ad-hoc system admin time available to be used on-demand
Many more details, and capabilities, but you get the idea ;)
Hahaha.
I wish I had said, if you ignore the advertising, it's a great resource. If you apply it to proposed backup solutions, it's an effective means to find out if they are viable.
I tend to think they're safer due to no moving parts.