Heroku's Managed DB's have been down for 2+ hours (opens in new tab)

(status.heroku.com)

71 pointskentf4y ago56 comments

56 comments

It mentions an issue at an upstream service provider. Is it AWS and their Degraded EBS Volume Performance in Northern Virginia? https://status.aws.amazon.com/

mcjiggerlog4y ago

Definitely appears to be a wider issue, circleci is having a major outage too: https://status.circleci.com.

hiyer4y ago

Likely - they're also reporting provisioning failures in Virgina, which is consistent with what AWS is reporting as well.

throwdecro4y ago

Is there an "uncanny reliability" range where increasing reliability on the part of a service provider makes things worse, by being so close to 100% reliable that any failure is a shock?

Maybe it's better to go with cheaper services that fail more often, thus keeping customers in good practice for how to deal with it.

strzibny4y ago

This is something along the line what I say in the Scaling chapter in my book[0]. If your infra is really simple (like a server or two), you can actually recreate it in a different provider and beat any hard to fix issue (whole AWS region going down or this Heroku's databases problem).

Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.

My book also contains a Bash script to configure you a PostgreSQL cluster in a few minutes with/without attached storage space, with self-signed SSL, SELinux, and more. Great for simple apps and as a start in learning production PostgreSQL.

[0] https://deploymentfromscratch.com/ [1] https://gist.github.com/strzibny/4f38345317a4d0866a35ede5aba...

busterarm4y ago

> Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.

This is so true you have no idea. Several years ago I was working at a Linode customer on the Christmas Eve that they started being DDOS'd for several days.

We had been working for weeks before then to multi-host our applications just to be prepared for outages and suddenly all of that work paid off.

We already had all of our data ready at another provider and the infrastructure hot so it was just a matter of flipping some configs and waiting for DNS propagation. I still ended up working 20 hours that day just monitoring everything and calming people down but the alternative would have been working straight through New Year's.

remus4y ago

Yes. There's a nice example of this in the Google SRE book (I think it may have been their internal paxos service?) If I remember they ended up building in planned downtime so users could learn to degrade gracefully if the service went down.

GeneralMayhem4y ago

Google does this pretty regularly internally. Every system has a published SLO, and for a couple weeks a year major components will respect their SLO and not a single request or millisecond better. If you were relying on something performing 10x better than what it's rated for in order to provide your own guarantees, then that's on you.

deepsun4y ago

Often that means spending 10x on building failure-tolerant architecture.

For example, software may assume that files get corrupted sitting on a disk, and work around that. But it turned out to be easier to build in the self-healing redundancy checks to the bottomest layer possible, to hard drives, and assume it's clean afterwards.

Another thing I've heard of, is when they make space radiation-resistant CPUs, instead of making the CPU robust to miscalculations, it's easier to shield it as much as possible, use larger process nodes (like 110nm+). Of course, they also make all kinds of checks in the software as well, because they do real engineering.

zokier4y ago

That's the theory behind chaos monkeys/simian army.

forgingahead4y ago

Heroku has been strangely unreliable the past few weeks. Even their ticket response team has been slow, with their support engineers often talking past the issue to just send a scripted reply.

We have the majority of our client apps hosted with them, but most don't require 24/7 availability. This is still concerning though, and we do have one high-availability app hosted on them now that we're trying to plan contingencies for.

Open to any suggestions for alternatives! Ideally I'd keep things on Heroku, but it would be nice to have failsafes that could be activated relatively quickly in the event of similar issues.

lbruder4y ago

Simple dynos can be replicated with Dokku and Ledokku as a GUI. Just get an Ubuntu VM on Digitalocean, Vultr or whatever, install and configure UFW, fail2ban and automatic security updates, install dokku and you're set.

For managed databases with replication however, Dokku still leaves much to be desired...

i3864y ago

I want a birthday cake. But first I'll be growing and milling my own grain, raising chickens and a cow. Water will be manually pumped from a well.

1 more reply

Daniel_sk4y ago

Signal is down too due to an outage of a service provider (I assume AWS).

supermatt4y ago

Whatever happened to 5-nines uptime? It seems no cloud service provider these days is able to offer what was considered an industry standard.

AWS even have documents telling people how to achieve exactly this! https://docs.aws.amazon.com/wellarchitected/latest/reliabili...

Why don't "premium" service providers like heroku, etc, do this?

zokier4y ago

> It seems no cloud service provider these days is able to offer what was considered an industry standard.

I wonder how many services really had 5 nines availability pre-cloud era either. Somehow I feel your view of it being "industry standard" might be slightly rose tinted

nicoffeine4y ago

Five nines uptime only exists in the mainframe world. Everywhere else it's a requirement set by someone in management, which is met by the vendor in their marketing material. It's never achieved over the long term, but enough time passes so the inevitable downtime can be blamed on the previous management. The vendor meets their "guarantee" by paying back less than a point on the yearly bill, and then everyone can reset the clock and pretend that it won't happen again.

The only people who suffer consequences are the staff forced to work overtime performing SEV0 RED ALERT theater. They will work through nights/weekends while the responsible parties tut-tut and "manage" by reading updates they can collate into the post crisis report. After that, everyone participates in the joy of emergency meetings to discuss said report that will be entirely worthless when a completely different part of the system fails the next time. A more reliable HA solution will be worked up by the engineers, finance will estimate implementation costs, and it will be turned down by an executive on the 8th hole green because they don't care about anything except improving profitability so they can hand themselves a bonus.

Not that I'm bitter or anything.

1 more reply

lbriner4y ago

The issue is more like the guarantee is not worth anything.

AWS/Azure/whomever "promise" 5 9s uptime. Something goes wrong, you don't get 5 9s, and what do you get?

A system that went down for 4 hours and a $50 rebate on your next bill!

2 more replies

Macha4y ago

Less customers, less moving parts, less to go wrong. I'm sure a lot of places were basically rolling the dice, but I'd imagine a lot won that bet while those that lost it had a much more difficult recovery process than today's vendors.

RantyDave4y ago

Plenty. But in most cases there was luck involved.

CodesInChaos4y ago

For non trivial services (in particular ones that need consistency), I'm skeptical that it's realistic to achieve 5-nines at competitive cost. You'll probably achieve it for several years, and then you run into a complex failure which takes 1h to fix, blowing through the downtime budget of a decade.

supermatt4y ago

In herokus case:

Their "dynos" are ephemeral. They could literally deploy the images to a backup environment hosted elsewhere. Their data services could all be synchronously replicated to that backup environment. And thats it - they dont offer any other core services (and their other services run on the same platform.)

So for (at most) double their infrastructure cost they have another network they can immediately switch over to.

And herokus already soooo expensive. Even if you used a 1-to-1 mapping for ec2 to heroku dynos (which they dont - its multiple dynos per backing instance), you would be looking at 5-10x markup using on-demand instances! Reserved instances are even less expensive. Spot instances can be 5x less again!

I think they could retain their current pricing model and still offer this kind of resiliency - at a minimum.

Fly.io is making strides in this direction, distributing the VMs across multiple availability zones, and routing traffic internally from their multiple geographically distributed POPs - but you need to roll your own DB VMs for multi-az synchronization..

EDIT: seems they do provide managed postgres with synchronized replication now (in beta), neat!

makeitdouble4y ago

TBF there is very few real world services that offer customers and non giant size companies 5-nines of uptime.

E.g. my electricity provider doesn't.

supermatt4y ago

Services providers such as Heroku should be easily able to have five-nines uptime.

They ONLY offer fully managed services, which can be backed by the multi-cloud, multi-AZ setup I refer to - but instead a single product outage from a single upstream provider in a single datacenter is affecting all their clients.

This is a regular occurrence for Heroku - and they charge a substantial premium for their "service".

1 more reply

busterarm4y ago

AWS doesn't offer 5-nines uptime on Compute.

It's S3 that is 5-nines availability.

AWS's published SLA for Compute (which includes EBS) is 4-nines.

https://aws.amazon.com/compute/sla/

terom4y ago

It's worth noting that the AWS EC2 99.99% SLA is a regional SLA, i.e. it only covers a situation where multiple AZs are down simultaneously.

One AZ going down is not covered by the 99.99% SLA. AFAIK there isn't any per-AZ SLA, only a single-instance SLA of 99.5%. The effective per-AZ SLA is going to be somewhere between the two.

1 more reply

supermatt4y ago

I know they dont, but in the link I gave they tell you how to achieve 5-nines via redundancy - something these cloud service providers (like heroku) neglect to implement.

ranguna4y ago

99.999% uptime still mean around 7 and a half hours of downtime per year.

rafBM4y ago

99.999% uptime is 5m 15s per year: https://uptime.is/99.999

1 more reply

bobviolier4y ago

No, just 5 minutes https://uptime.is/99.999

ranguna4y ago

Ups, quick maths == wrong maths.

kentfOP4y ago

Luckily we had followers in different zones by chance. Still scary though. What are the best solutions for replicating Heroku in different clouds?

mappu4y ago

If by "replicating" you mean replicating the experience, then you're looking for Dokku

Easy setup + MIT license and you get the same git push deploys, Heroku-compatible buildpacks or bring your own Dockerfile.

I would recommend running it on probably on Linode / DigitalOcean / Vultr.

strken4y ago

Dokku is great for a single host. If you have a more complicated setup you can go a long way with post-receive hooks, although it won't be as magical without buildpacks.

aswinmohanme4y ago

Fly.io comes close. You have docker based builds on the cloud with native multi region support. Postgres is in beta right now.

mcintyre19944y ago

They also have built-in Heroku migration: https://fly.io/docs/app-guides/speed-up-a-heroku-app/

corobo4y ago

DigitalOcean's App platform things plays nicely with Heroku buildpacks from what I've seen

thomaslord4y ago

Interestingly, I have an app using Heroku Postgres that seems to have had zero issues during this outage. I can see data that was stored during this period of time and Rollbar doesn't show any DB connection errors.

pvsukale34y ago

I have been trying to deploy fix to a bug we deployed yesterday. I think they have stopped deploys as well. As the deploys are being rejected without any explanation.

TedShiller4y ago

Heroku may have been down for 2+ hours, but MongoDB has been unreliable for 10+ years.

Insalgo4y ago

And why are you referring to mongo here?

j / k navigate · click thread line to collapse

56 comments

pgn6744y ago

It mentions an issue at an upstream service provider. Is it AWS and their Degraded EBS Volume Performance in Northern Virginia? https://status.aws.amazon.com/

mcjiggerlog4y ago

Definitely appears to be a wider issue, circleci is having a major outage too: https://status.circleci.com.

hiyer4y ago

Likely - they're also reporting provisioning failures in Virgina, which is consistent with what AWS is reporting as well.

throwdecro4y ago

Is there an "uncanny reliability" range where increasing reliability on the part of a service provider makes things worse, by being so close to 100% reliable that any failure is a shock?

Maybe it's better to go with cheaper services that fail more often, thus keeping customers in good practice for how to deal with it.

strzibny4y ago

Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.

[0] https://deploymentfromscratch.com/ [1] https://gist.github.com/strzibny/4f38345317a4d0866a35ede5aba...

busterarm4y ago

> Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.

This is so true you have no idea. Several years ago I was working at a Linode customer on the Christmas Eve that they started being DDOS'd for several days.

We had been working for weeks before then to multi-host our applications just to be prepared for outages and suddenly all of that work paid off.

remus4y ago

GeneralMayhem4y ago

deepsun4y ago

Often that means spending 10x on building failure-tolerant architecture.

zokier4y ago

That's the theory behind chaos monkeys/simian army.

forgingahead4y ago

Heroku has been strangely unreliable the past few weeks. Even their ticket response team has been slow, with their support engineers often talking past the issue to just send a scripted reply.

Open to any suggestions for alternatives! Ideally I'd keep things on Heroku, but it would be nice to have failsafes that could be activated relatively quickly in the event of similar issues.

lbruder4y ago

For managed databases with replication however, Dokku still leaves much to be desired...

i3864y ago

I want a birthday cake. But first I'll be growing and milling my own grain, raising chickens and a cow. Water will be manually pumped from a well.

1 more reply

Daniel_sk4y ago

Signal is down too due to an outage of a service provider (I assume AWS).

supermatt4y ago

Whatever happened to 5-nines uptime? It seems no cloud service provider these days is able to offer what was considered an industry standard.

AWS even have documents telling people how to achieve exactly this! https://docs.aws.amazon.com/wellarchitected/latest/reliabili...

Why don't "premium" service providers like heroku, etc, do this?

zokier4y ago

> It seems no cloud service provider these days is able to offer what was considered an industry standard.

I wonder how many services really had 5 nines availability pre-cloud era either. Somehow I feel your view of it being "industry standard" might be slightly rose tinted

nicoffeine4y ago

Not that I'm bitter or anything.

1 more reply

lbriner4y ago

The issue is more like the guarantee is not worth anything.

AWS/Azure/whomever "promise" 5 9s uptime. Something goes wrong, you don't get 5 9s, and what do you get?

A system that went down for 4 hours and a $50 rebate on your next bill!

2 more replies

Macha4y ago

RantyDave4y ago

Plenty. But in most cases there was luck involved.

CodesInChaos4y ago

supermatt4y ago

In herokus case:

So for (at most) double their infrastructure cost they have another network they can immediately switch over to.

I think they could retain their current pricing model and still offer this kind of resiliency - at a minimum.

EDIT: seems they do provide managed postgres with synchronized replication now (in beta), neat!

makeitdouble4y ago

TBF there is very few real world services that offer customers and non giant size companies 5-nines of uptime.

E.g. my electricity provider doesn't.

supermatt4y ago

Services providers such as Heroku should be easily able to have five-nines uptime.

This is a regular occurrence for Heroku - and they charge a substantial premium for their "service".

1 more reply

busterarm4y ago

AWS doesn't offer 5-nines uptime on Compute.

It's S3 that is 5-nines availability.

AWS's published SLA for Compute (which includes EBS) is 4-nines.

https://aws.amazon.com/compute/sla/

terom4y ago

It's worth noting that the AWS EC2 99.99% SLA is a regional SLA, i.e. it only covers a situation where multiple AZs are down simultaneously.

One AZ going down is not covered by the 99.99% SLA. AFAIK there isn't any per-AZ SLA, only a single-instance SLA of 99.5%. The effective per-AZ SLA is going to be somewhere between the two.

1 more reply

supermatt4y ago

I know they dont, but in the link I gave they tell you how to achieve 5-nines via redundancy - something these cloud service providers (like heroku) neglect to implement.

ranguna4y ago

99.999% uptime still mean around 7 and a half hours of downtime per year.

rafBM4y ago

99.999% uptime is 5m 15s per year: https://uptime.is/99.999

1 more reply

bobviolier4y ago

No, just 5 minutes https://uptime.is/99.999

ranguna4y ago

Ups, quick maths == wrong maths.

kentfOP4y ago

Luckily we had followers in different zones by chance. Still scary though. What are the best solutions for replicating Heroku in different clouds?

mappu4y ago

If by "replicating" you mean replicating the experience, then you're looking for Dokku

Easy setup + MIT license and you get the same git push deploys, Heroku-compatible buildpacks or bring your own Dockerfile.

I would recommend running it on probably on Linode / DigitalOcean / Vultr.

strken4y ago

Dokku is great for a single host. If you have a more complicated setup you can go a long way with post-receive hooks, although it won't be as magical without buildpacks.

aswinmohanme4y ago

Fly.io comes close. You have docker based builds on the cloud with native multi region support. Postgres is in beta right now.

mcintyre19944y ago

They also have built-in Heroku migration: https://fly.io/docs/app-guides/speed-up-a-heroku-app/

corobo4y ago

DigitalOcean's App platform things plays nicely with Heroku buildpacks from what I've seen

thomaslord4y ago

pvsukale34y ago

I have been trying to deploy fix to a bug we deployed yesterday. I think they have stopped deploys as well. As the deploys are being rejected without any explanation.

TedShiller4y ago

Heroku may have been down for 2+ hours, but MongoDB has been unreliable for 10+ years.

Insalgo4y ago

And why are you referring to mongo here?

j / k navigate · click thread line to collapse