If you're down for 5 minutes a year because one of your employees broke something, that's your fault, and the blame passes down through the CTO.
If you're down for 5 hours a year but this affected other companies too, it's not your fault
From AWS to Crowdstrike - system resilience and uptime isn't the goal. Risk mitigation isn't the goal. Affordability isn't the goal.
When the CEO's buddies all suffer at the same time as he does, it's just an "act of god" and nothing can be done, it's such a complex outcome that even the amazing boffins at aws/google/microsoft/cloudflare/etc can't cope.
If the CEO is down at a different time than the CEO's buddies then it's that Dave/Charlie/Bertie/Alice can't cope and it's the CTO's fault for not outsourcing it.
As someone who likes to see things working, it pisses me off no end, but it's the way of the world, and likely has been whenever the owner and CTO are separate.
After that process comes the BS and PR step, where reality is spun into a cotton candy that makes the leader look good no matter what.
Yes.
What is important is having a Contractual SLA that is defensible. Acts of God are defensible. And now major cloud infrastructure outtages are too.
* Give the computers a rest, they probably need it. Heck, maybe the Internet should just shut down in the evening so everyone can go to bed (ignoring those pesky timezone differences)
* Free chaos engineering at the cloud provider region scale, except you didn't opt in to this one and know about in advance, making it extra effective
* Quickly figure out a map which of the things you use have a dependency on a single AWS region without no capability to change or re-route
AWS outages: almost never happens, you should have been more prepared for when it does
If you say it’s Microsoft then it’s just unavoidable.
But there are some people on Reddit who think we are all wrong but won't say anything more. So... whatever.
Nothing in the outage history really stands out as "this is the first time we tried this and oops" except for us-east-1.
It's always possible for things to succeed at a smaller scale and fail at full scale, but again none of them really stand out as that to me. Or at least, not any in the last ten years. I'm allowing that anything older than that is on the far side of substantial process changes and isn't representative anymore.
Still, it would make a bit of sense if you can find a place in your code where crossing a region hurts less, to move some of your services to a different region.
While your business partners will understand that you’re down while they’re down, will your customers? You called yesterday to say their order was ready, and now they can’t pick it up?
Turns out the default URL was hardcoded to use the us east interface and just by going to workspaces and then editing your URL to be the local region got everyone working again.
Unless you mean nothing is working for you at the moment.