Right now all of our sites are failing with 503 errors. Our store is down and when one of our employees went to take a phone order they got a "Welcome to your new app" message.
I've been a big evangelist of Heroku since we migrated over last year, but I'm getting deeply concerned about the elevated error rate since every minute is costing us money.
At some point they're exposing themselves to serious risk. Rackspace had to pay out ~$3MM (in free service credits) after an outage in 2009:
http://www.networkworld.com/news/2009/070609-rackspace-outag...
Disclaimer: I'm not affiliated with Heroku and I don't use their service.
That's silly, and also not how it works, at all. You're paying PaaS/IaaS companies so that it's their headache, not yours. Once it becomes your headache, they are no longer doing their job, and you are no longer receiving value for which you are paying for. You don't debit the cost of their downtime from the cost i would've built your infrastructure, you debit the cost of their downtime from your business' revenue and reputation.
Whether or not you could do it better yourself does not excuse the downtime one bit.
So, "could we do better"? I'm not sure. I'm trying to figure that out. It certainly would not be as easy to use as Heroku or easy to deploy. But at a minimum I need to get some other host option set that we can switch over to.
Custom error pages for these kinds of errors would be very useful.
They did go a long way in a short period of time. Winter 2008 feels so close.
Heroku had a 1-2 hour outage the week after we switched an app there last year. My boss was freaking out, cursing about how they were unreliable, etc, neglecting the following:
1. The timing was unfortunate, but that was the first outage in months.
2. We had had multiple outages on our Rackspace box that were our own fault, due to bad server management.
In the long term you're likely better on Heroku, for small companies at least.
Internal examples:
If shadowcat's public facing website is down for a day, a few people can't read blog posts and maybe we'll miss out on a potential customer - but our existing customers will be entirely unaffected.
If our ticket tracking system is down for a day, it'll annoy the hell out of the existing customers but we can still get the work done since they all have direct email and IM contact info for people.
On the other hand if our ircd is down for an hour, it's time to panic, because that massively interrupts our ability to co-ordinate our work.
External examples:
If linked in is down for a day, I don't care - anything I do on that can wait until tomorrow.
If duckduckgo is down for a day, I am going to burst into tears because I use it all the time for information I want -now- and going via google is substantially more annoying.
So "anything that matters" is really quite relative.
I just did the calculation. That's about a day of downtime. I'd say it's bad if:
- The downtime is scattered all over the year. 1 hour downtime here, 30 min downtime there.
But not if:
- This 1 day of downtime is scheduled, e.g. during the holidays. Scheduled and planned is the keyword. If the client is informed and aware of it, the client will also remain happy.
You'd be surprised how much downtime clients are willing to put up with, as long as they are informed well ahead of time.
Even in places like medicine or finance or security. Stuff breaks, things fail. It's sad, but the reality is there.
Heroku, on the other hand, feels like it's up and down more than... something that goes up and down a lot. A friend of mine hosts his blog there and he launched a small product today and he kept sending his customers to an error page, because Heroku was up, down, up down, up down.
If it's a misconfiguration of your own, you can get it fixed. But if your hosting provider has an unsound business, you can't fix that except by leaving.
Downtime always sucks, but gotta give them credit the way they keep everyone in the loop and provided status along the way.
I have 5 minute watchdogs on all of my 3 sites in production with Heroku, and none of them pinged me. Given that I know the watchdogs work (regular testing and previous incidents) I would have to conclude that not everyone was affected.
Our app has a cyclic usage pattern and all is quiet right now. So rather than freaking out about it, I'll just let someone at Heroku figure it all out.
It would suck if it happened during our busy period, but then again I could say "We're working on it." and just assume the Heroku team will fix things faster than I ever could have with my limited *nix admin skills.
I get that you're saying your users don't care/didn't notice, but I'm clearly missing something because if I had an app on Heroku, I'd be a little nervous. When the cyclic nature of your app swings back around and it's in regular use again, this kind of outage might not be so magical.
Users surely noticed, but Heroku definitely noticed before my users did. They're quietly working on a solution and I can quietly go about my day. If my users start complaining, I'll have time to talk to them; time I wouldn't have if I was neck deep in log spew.
Having run apps on my own servers before, I know what a pain in the ass it is to deal with downtime yourself. I'm not particularly good at it, so I appreciate having experts take care of it for me.
Initial laziness now adds up.
Heroku so far has not had major outages.
And they will be learning from the current ones.