undefined | Better HN

0 pointsJCM97mo ago0 comments

Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.

The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!

0 comments

radium3d7mo ago

Once you've had an outage on AWS, Cloudflare, Google Cloud, Akismet. What are you going to do? Host in house? None of them seem to be immune from some outage at some point. Get your refund and carry on. It's less work for the same outcome.

CobrastanJorji7mo ago

Multi-cloud. It's fairly unlikely that AWS and Google Cloud are going to fail at the same time.

radium3d7mo ago

Yeah, just double++ the cost to have a clone of all your systems. Worth it if you need to guarantee uptime. Although, it also doubles your exposure to potential data breaches as well.

jancsika7mo ago

> double++

I'd suggest to ++double the cost. Compare:

++double: spoken as "triple" -> team says that double++ was a joke, we can obviously only double the cost -> embarrassingly you quickly agree -> team laughs -> team approves doubling -> you double the cost -> team goes out for beers -> everyone is happy

double++: spoken as "double" -> team quickly agrees and signs off -> you consequently triple the cost per c precedence rules -> manager goes ballistic -> you blithely recount the history of c precedence in a long monotone style -> job returns EINVAL -> beers = 0

1 more reply

dexterdog7mo ago

And likely far more than double the cost since you have to use the criminally-priced outbound bandwidth to keep everything in sync.

unethical_ban7mo ago

Shouldn't be double in the long term. Think of the second cloud as a cold standby. Depends on the system. Periodic replication of data layer (object storage/database) and CICD configured to be able to build services and VMs on multiple clouds. Have automatic tests weekly/monthly that represent end-to-end functionality, have scaled tests semi-annually.

This is all very, very hand-wavey. And if one says "golly gee, all our config is too cloud specific to do multi-cloud" then you've figured out why cloud blows and that there is no inherent reason not to have API standards for certain mature cloud services like serverless functions, VMs and networks.

Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.

jimbob457mo ago

And data egress fees just to get the clone set up, right? This doesn’t seem feasible as a macrostrategy. Maybe for a small number of critical services.

yeswecatan7mo ago

How do you handle replication lag for databases?

zacmps7mo ago

If you use something like cockroachdb you can have a multi-master cluster and use regional-by-row tables to locate data close to users. It'll fail over fine to other regions if needed.

Breza7mo ago

Why not host in house? If you have an application with stable resource needs, it can often be the cheaper and more stable option. At a certain scale, you can buy the servers, hire a sysadmin, and still spend less money than relying on AWS.

If you have an app that experiences 1000x demand spikes at unpredictable times then sure, go with the cloud. But there are a lot of companies that would be better off if they seriously considered their options before choosing the cloud for everything.

grogers7mo ago

Certainly if you aren't even multi-region, then multi-cloud is a pipe dream

bean4697mo ago

> What are you going to do? Host in house?

Yep. Although it's just anecdata, it's what we do where I work - haven't had a slightest issue in years.

nxpnsv7mo ago

Cheaper, faster, in house people understands what’s going on. It should be a given for many services but somehow it’s not.

Breza7mo ago

I totally agree with you. Where I work, we self-host almost everything. Exceptions are we use a CDN for one area where we want lower latency, and we use BigQuery when we need to parse a few billion datapoints into something usable.

It's amazing how few problems we have. Honestly, I don't think we have to worry about configuration issues as often as people who rely on the cloud.

erikpukinskis7mo ago

On premise? Or do you build servers in a data center? Or do you lease dedicated servers?

bean4696mo ago

We have our own data center with servers. The upfront costs are high, but it was worth it in our use-case

Breza7mo ago

Not GP, but my company also self-hosts. We rent rackspace in a colo. We used to keep my team's research server in the back closet before we went full-remote.

cakeday7mo ago

> Host in house?

Yes, mostly.

cmiles87mo ago

This. When Andy Jassy got challenged by analysts on the last earnings call on why AWS has fallen so far behind on innovation in areas his answer was a hand wavy response that diverted attention to say AWS is durable, stable, and reliable and customers care more about that. Oops.

judahmeek7mo ago

behind on innovation how exactly?

sharpy7mo ago

The culture changed. When I first worked there, I was encouraged to take calculated risks. When I did my second tour of duty, people were deathly afraid of bringing down services. It has been a while since my second tour of duty, but I don't think it's back to "Amazon is a place where builders can build".

everfrustrated7mo ago

Somewhat inevitable for any company as they get larger. Easy to move fast and break things when you have 1 user and no revenue. Very different story when much of US commerce runs you on.

AbstractH247mo ago

For folks who came of age in the late 00's, seeing companies once thought of as disruptors and innovators become the old stalwarts post-pandemic/ZIRP has been quite an experience.

Maybe those who have been around longer have seen this before, but its the first time for me.

1 more reply

llmslave7mo ago

If you bring something down in a real way, you can forget about someone trusting you with a big project in the future. You basically need to switch orgs

chaostheory7mo ago

Curious. When did AWS hit “Day Two”, or what year was your 2nd tour of duty?

1 more reply

RedShift17mo ago

I've never heard tour of duty being used outside of the military, is it really that bad over at AWS it has to be called that?

1 more reply

JCM9OP7mo ago

I listened to the earnings call. I believe the question was mostly focused on why AWS has been so behind on AI. Jassy did flub the question quite badly and rambled on for a while. The press has mentioned the botched answer in a few articles recently.

etothet7mo ago

They have been pushing me and company extremely hard to vet their various AI-related offerings. When we decide to look into whatever service it is, we come away underwhelmed. It seems like their biggest selling point so far is “we’ll give it to you free for several months”. Not great.

2 more replies

gregsadetsky7mo ago

Fascinating, thanks for sharing this.

I found this summary:

https://fortune.com/2025/07/31/amazon-aws-ai-andy-jassy-earn...

And the transcript (there’s an annoying modal obscuring a bit of the page, but it’s still readable):

https://seekingalpha.com/article/4807281-amazon-com-inc-amzn...

(search for the word “tough”)

ifwinterco7mo ago

Everything except us-east-1 is generally pretty reliable. At $work we have a lot of stuff that's only on eu-west-1 (yes not the best practice) and we haven't had any issues, touch wood

ttul7mo ago

My impression is that `us-east-1` has the worst reliability track record of any region. We've always run our stuff in `us-west-2` and there has never been an outage that took us down in that region. By contrast, a few things that we had in `us-east-1` have gone down repeatedly.

hnfong7mo ago

Just curious, what's special about us-east-1?

stego-tech7mo ago

It’s the “original” AWS region. It has the most legacy baggage, the most customer demand (at least in the USA), and it’s also the region that hosts the management layer of most “global” services. Its availability has also been dogshit, but because companies only care about costs today and not harms tomorrow, they usually hire or contract out to talent that similarly only cares about the bottom line today and throws stuff into us-east-1 rather than figure out AZs and regions.

The best advice I can give to any org in AWS is to get out of us-east-1. If you use a service whose management layer is based there, make sure you have break-glass processes in place or, better yet, diversify to other services entirely to reduce/eliminate single points of failure.

dijit7mo ago

I have a joke from 15 years ago, where I compared my friend who flaked out all the time as "having less availability than US-EAST-1".

This is not a new issue caused by improper investment, it's always been this way.

riknos3147mo ago

Former AWS employee here. There's a number of reasons but it mostly boils down to:

It's both the oldest and largest (most ec2 hosts, most objects in s3, etc) AWS region, and due to those things it's the region most likely to encounter an edge case in prod.

mvkel7mo ago

It's closest to "geographical center" so traffic from Europe feels faster than us-west

tete7mo ago

> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!

This is and was never true. I've done setups in the past where monitoring happened "multi cloud" with also multiple dedicated servers. Was pretty broad so you could actually see where things broke.

Was quite some time ago so I don't have the data, but AWS never came out on top.

It actually matched largely with what netcraft.com put out. Not sure if they still do that and release those things to the public.

testplzignore7mo ago

Netcraft confirmed it? I haven't heard that name since the Slashdot era :)

dotancohen6mo ago

Get off my lawn you insensitive clod!

chaostheory7mo ago

This makes sense given all the open source projects coming out of Netflix like chaos monkey.

eric-hu7mo ago

Which cloud provider came out on top?

llmslave7mo ago

AWS has been in long term decline, most of the platform is just in keeping the lights on mode. Its also why they are behind on AI, alot of would be innovative employees get crushed under red tape and performance management

nextworddev7mo ago

Good thing they are the biggest investor into Anthropic

GoblinSlayer7mo ago

But then you will be affected by outages of every dependency you use.

caymanjim7mo ago

This is the real problem. Even if you don't run anything in AWS directly, something you integrate with will. And when us-east-1 is down, it doesn't matter if those services are in other availability zones. AWS's own internal services rely heavily on us-east-1, and most third-party services live in us-east-1.

It really is a single point of failure for the majority of the Internet.

dexterdog7mo ago

This becomes the reason to run in us-east-1 if you're going to be single region. When it's down nobody is surprised that your service is affected. If you're all-in on some other region and it goes down you look like you don't know what you're doing.

kelseydh7mo ago

This whole incident has been pretty uneventful down in Australia where everything AWS is on ap-southeast-2.

parliament327mo ago

> Even if you don't run anything in AWS directly, something you integrate with will.

Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"

caymanjim7mo ago

It's easy to say this, but in the real world, most of the critical path is heavily-dependent on third party integrations. User auth, storage, logging, etc. Even if you're somewhat-resilient against failures (i.e. you can live without logging and your app doesn't hard fail), it's still potentially going to cripple your service. And even if your entire app is resilient and doesn't fail, there are still bound to be tons of integrations that will limit functionality, or make the app appear broken in some way to users.

The reason third-party things are in the critical path is because most of the time, they are still more reliable than self-hosting everything; because they're cheaper than anything you can engineer in-house; because no app is an island.

It's been decades since I worked on something that was completely isolated from external integrations. We do the best we can with redundancy, fault tolerance, auto-recovery, and balance that with cost and engineering time.

If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.

1 more reply

jen207mo ago

With the exception of Amazon, anyone in this situation already has a third-party product in their critical path - AWS itself.

chasd007mo ago

> Why would a third-party be in your product's critical path?

i bet only 1-2% of AI startups are running their own models and the rest are just bouncing off OpenAI, Azure, or some other API.

thinkindie7mo ago

Not necessarily our critical path but today circleci was affected greatly which also affected our capacity to deploy. Luckily it was a Monday morning therefore we didn’t even have to deploy an hot fix.

pcdevils7mo ago

That's nearly every ai start-up done for

macintux7mo ago

No man is an island, entire of itself

unethical_ban7mo ago

* IAM / Okta * Cloud VPN services * Cloud Office (GSuite, Office365)

Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.

2 more replies

1-67mo ago

Glad that you're taking the first step toward resiliency. At times, big outages like these are necessary to give a good reason why the company should Multicloud. When things are working without problems, no one cares to listen to the squeaky wheel.

morshu90017mo ago

This was a single region outage, right? If you aren't cross-region, cross-cloud is the same but harder

jen207mo ago

I would be interested in a follow up in 2-3 years as to whether you've had fewer issues with a multi-cloud setup than just AWS. My suspicion is that will not be the case.

FlynnLivesMattr7mo ago

How did the call go?

lootgraft7mo ago

> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud.

If an internal "AWS team" then this translates to "I am comfortable using this tool, and am uninterested in having to learn an entirely new stack."

If you have to diversify your cloud workloads give your devops team more money to do so.

ej_campbell7mo ago

Aren't you deployed in multiple regions?

BoredPositron7mo ago

Still no serverless inference for models or inference pipes that are not available on bedrock, still no auto scaling GPU workers. We started bothering them in 2022...crickets

wrasee7mo ago

Please tell me there was a mixup and for some reason they didn’t show up.

j / k navigate · click thread line to collapse

0 comments

radium3d7mo ago

CobrastanJorji7mo ago

Multi-cloud. It's fairly unlikely that AWS and Google Cloud are going to fail at the same time.

radium3d7mo ago

Yeah, just double++ the cost to have a clone of all your systems. Worth it if you need to guarantee uptime. Although, it also doubles your exposure to potential data breaches as well.

jancsika7mo ago

> double++

I'd suggest to ++double the cost. Compare:

1 more reply

dexterdog7mo ago

And likely far more than double the cost since you have to use the criminally-priced outbound bandwidth to keep everything in sync.

unethical_ban7mo ago

Edit to add- ink ow how grossly simplified this is, and that most places have massively complex systems.

jimbob457mo ago

And data egress fees just to get the clone set up, right? This doesn’t seem feasible as a macrostrategy. Maybe for a small number of critical services.

yeswecatan7mo ago

How do you handle replication lag for databases?

zacmps7mo ago

If you use something like cockroachdb you can have a multi-master cluster and use regional-by-row tables to locate data close to users. It'll fail over fine to other regions if needed.

Breza7mo ago

grogers7mo ago

Certainly if you aren't even multi-region, then multi-cloud is a pipe dream

bean4697mo ago

> What are you going to do? Host in house?

Yep. Although it's just anecdata, it's what we do where I work - haven't had a slightest issue in years.

nxpnsv7mo ago

Cheaper, faster, in house people understands what’s going on. It should be a given for many services but somehow it’s not.

Breza7mo ago

It's amazing how few problems we have. Honestly, I don't think we have to worry about configuration issues as often as people who rely on the cloud.

erikpukinskis7mo ago

On premise? Or do you build servers in a data center? Or do you lease dedicated servers?

bean4696mo ago

We have our own data center with servers. The upfront costs are high, but it was worth it in our use-case

Breza7mo ago

Not GP, but my company also self-hosts. We rent rackspace in a colo. We used to keep my team's research server in the back closet before we went full-remote.

cakeday7mo ago

> Host in house?

Yes, mostly.

cmiles87mo ago

judahmeek7mo ago

behind on innovation how exactly?

sharpy7mo ago

everfrustrated7mo ago

Somewhat inevitable for any company as they get larger. Easy to move fast and break things when you have 1 user and no revenue. Very different story when much of US commerce runs you on.

AbstractH247mo ago

For folks who came of age in the late 00's, seeing companies once thought of as disruptors and innovators become the old stalwarts post-pandemic/ZIRP has been quite an experience.

Maybe those who have been around longer have seen this before, but its the first time for me.

1 more reply

llmslave7mo ago

If you bring something down in a real way, you can forget about someone trusting you with a big project in the future. You basically need to switch orgs

chaostheory7mo ago

Curious. When did AWS hit “Day Two”, or what year was your 2nd tour of duty?

1 more reply

RedShift17mo ago

I've never heard tour of duty being used outside of the military, is it really that bad over at AWS it has to be called that?

1 more reply

JCM9OP7mo ago

etothet7mo ago

2 more replies

gregsadetsky7mo ago

Fascinating, thanks for sharing this.

I found this summary:

https://fortune.com/2025/07/31/amazon-aws-ai-andy-jassy-earn...

And the transcript (there’s an annoying modal obscuring a bit of the page, but it’s still readable):

https://seekingalpha.com/article/4807281-amazon-com-inc-amzn...

(search for the word “tough”)

ifwinterco7mo ago

Everything except us-east-1 is generally pretty reliable. At $work we have a lot of stuff that's only on eu-west-1 (yes not the best practice) and we haven't had any issues, touch wood

ttul7mo ago

hnfong7mo ago

Just curious, what's special about us-east-1?

stego-tech7mo ago

dijit7mo ago

I have a joke from 15 years ago, where I compared my friend who flaked out all the time as "having less availability than US-EAST-1".

This is not a new issue caused by improper investment, it's always been this way.

riknos3147mo ago

Former AWS employee here. There's a number of reasons but it mostly boils down to:

It's both the oldest and largest (most ec2 hosts, most objects in s3, etc) AWS region, and due to those things it's the region most likely to encounter an edge case in prod.

mvkel7mo ago

It's closest to "geographical center" so traffic from Europe feels faster than us-west

tete7mo ago

> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!

This is and was never true. I've done setups in the past where monitoring happened "multi cloud" with also multiple dedicated servers. Was pretty broad so you could actually see where things broke.

Was quite some time ago so I don't have the data, but AWS never came out on top.

It actually matched largely with what netcraft.com put out. Not sure if they still do that and release those things to the public.

testplzignore7mo ago

Netcraft confirmed it? I haven't heard that name since the Slashdot era :)

dotancohen6mo ago

Get off my lawn you insensitive clod!

chaostheory7mo ago

This makes sense given all the open source projects coming out of Netflix like chaos monkey.

eric-hu7mo ago

Which cloud provider came out on top?

llmslave7mo ago

nextworddev7mo ago

Good thing they are the biggest investor into Anthropic

GoblinSlayer7mo ago

But then you will be affected by outages of every dependency you use.

caymanjim7mo ago

It really is a single point of failure for the majority of the Internet.

dexterdog7mo ago

kelseydh7mo ago

This whole incident has been pretty uneventful down in Australia where everything AWS is on ap-southeast-2.

parliament327mo ago

> Even if you don't run anything in AWS directly, something you integrate with will.

Why would a third-party be in your product's critical path? It's like the old business school thing about "don't build your business on the back of another"

caymanjim7mo ago

If you think this is bad, take a look at the uptime of complicated systems that are 100% self-hosted. Without a Fortune 500 level IT staff, you can't beat AWS's uptime.

1 more reply

jen207mo ago

With the exception of Amazon, anyone in this situation already has a third-party product in their critical path - AWS itself.

chasd007mo ago

> Why would a third-party be in your product's critical path?

i bet only 1-2% of AI startups are running their own models and the rest are just bouncing off OpenAI, Azure, or some other API.

thinkindie7mo ago

pcdevils7mo ago

That's nearly every ai start-up done for

macintux7mo ago

No man is an island, entire of itself

unethical_ban7mo ago

* IAM / Okta * Cloud VPN services * Cloud Office (GSuite, Office365)

Good luck naming a large company, bank, even utility that doesn't have some kind of dependency like this somewhere, even if they have mostly on-prem services.

2 more replies

1-67mo ago

morshu90017mo ago

This was a single region outage, right? If you aren't cross-region, cross-cloud is the same but harder

jen207mo ago

I would be interested in a follow up in 2-3 years as to whether you've had fewer issues with a multi-cloud setup than just AWS. My suspicion is that will not be the case.

FlynnLivesMattr7mo ago

How did the call go?

lootgraft7mo ago

> The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud.

If an internal "AWS team" then this translates to "I am comfortable using this tool, and am uninterested in having to learn an entirely new stack."

If you have to diversify your cloud workloads give your devops team more money to do so.

ej_campbell7mo ago

Aren't you deployed in multiple regions?

BoredPositron7mo ago

Still no serverless inference for models or inference pipes that are not available on bedrock, still no auto scaling GPU workers. We started bothering them in 2022...crickets

wrasee7mo ago

Please tell me there was a mixup and for some reason they didn’t show up.

j / k navigate · click thread line to collapse