Maybe it's better to go with cheaper services that fail more often, thus keeping customers in good practice for how to deal with it.
Especially with smaller applications, you might be able to beat the provider time to fix the issue, and you never know when it might be critical for you to be able to do that.
My book also contains a Bash script to configure you a PostgreSQL cluster in a few minutes with/without attached storage space, with self-signed SSL, SELinux, and more. Great for simple apps and as a start in learning production PostgreSQL.
[0] https://deploymentfromscratch.com/ [1] https://gist.github.com/strzibny/4f38345317a4d0866a35ede5aba...
This is so true you have no idea. Several years ago I was working at a Linode customer on the Christmas Eve that they started being DDOS'd for several days.
We had been working for weeks before then to multi-host our applications just to be prepared for outages and suddenly all of that work paid off.
We already had all of our data ready at another provider and the infrastructure hot so it was just a matter of flipping some configs and waiting for DNS propagation. I still ended up working 20 hours that day just monitoring everything and calming people down but the alternative would have been working straight through New Year's.
For example, software may assume that files get corrupted sitting on a disk, and work around that. But it turned out to be easier to build in the self-healing redundancy checks to the bottomest layer possible, to hard drives, and assume it's clean afterwards.
Another thing I've heard of, is when they make space radiation-resistant CPUs, instead of making the CPU robust to miscalculations, it's easier to shield it as much as possible, use larger process nodes (like 110nm+). Of course, they also make all kinds of checks in the software as well, because they do real engineering.
We have the majority of our client apps hosted with them, but most don't require 24/7 availability. This is still concerning though, and we do have one high-availability app hosted on them now that we're trying to plan contingencies for.
Open to any suggestions for alternatives! Ideally I'd keep things on Heroku, but it would be nice to have failsafes that could be activated relatively quickly in the event of similar issues.
For managed databases with replication however, Dokku still leaves much to be desired...
AWS even have documents telling people how to achieve exactly this! https://docs.aws.amazon.com/wellarchitected/latest/reliabili...
Why don't "premium" service providers like heroku, etc, do this?
I wonder how many services really had 5 nines availability pre-cloud era either. Somehow I feel your view of it being "industry standard" might be slightly rose tinted
The only people who suffer consequences are the staff forced to work overtime performing SEV0 RED ALERT theater. They will work through nights/weekends while the responsible parties tut-tut and "manage" by reading updates they can collate into the post crisis report. After that, everyone participates in the joy of emergency meetings to discuss said report that will be entirely worthless when a completely different part of the system fails the next time. A more reliable HA solution will be worked up by the engineers, finance will estimate implementation costs, and it will be turned down by an executive on the 8th hole green because they don't care about anything except improving profitability so they can hand themselves a bonus.
Not that I'm bitter or anything.
AWS/Azure/whomever "promise" 5 9s uptime. Something goes wrong, you don't get 5 9s, and what do you get?
A system that went down for 4 hours and a $50 rebate on your next bill!
Their "dynos" are ephemeral. They could literally deploy the images to a backup environment hosted elsewhere. Their data services could all be synchronously replicated to that backup environment. And thats it - they dont offer any other core services (and their other services run on the same platform.)
So for (at most) double their infrastructure cost they have another network they can immediately switch over to.
And herokus already soooo expensive. Even if you used a 1-to-1 mapping for ec2 to heroku dynos (which they dont - its multiple dynos per backing instance), you would be looking at 5-10x markup using on-demand instances! Reserved instances are even less expensive. Spot instances can be 5x less again!
I think they could retain their current pricing model and still offer this kind of resiliency - at a minimum.
Fly.io is making strides in this direction, distributing the VMs across multiple availability zones, and routing traffic internally from their multiple geographically distributed POPs - but you need to roll your own DB VMs for multi-az synchronization..
EDIT: seems they do provide managed postgres with synchronized replication now (in beta), neat!
E.g. my electricity provider doesn't.
They ONLY offer fully managed services, which can be backed by the multi-cloud, multi-AZ setup I refer to - but instead a single product outage from a single upstream provider in a single datacenter is affecting all their clients.
This is a regular occurrence for Heroku - and they charge a substantial premium for their "service".
It's S3 that is 5-nines availability.
AWS's published SLA for Compute (which includes EBS) is 4-nines.
One AZ going down is not covered by the 99.99% SLA. AFAIK there isn't any per-AZ SLA, only a single-instance SLA of 99.5%. The effective per-AZ SLA is going to be somewhere between the two.
Easy setup + MIT license and you get the same git push deploys, Heroku-compatible buildpacks or bring your own Dockerfile.
I would recommend running it on probably on Linode / DigitalOcean / Vultr.