Ongoing Incident in Google Cloud (opens in new tab)

(status.cloud.google.com)

235 pointssd2k3y ago105 comments

105 comments

This affected us starting at 4:57am US/Pacific with a significant drop in traffic through the HTTPS Global Load Balancer across all regions and Pub/Sub 502 errors but there was nothing on the status page for another 45 minutes. Things returned to normal by 5:05am from what I can tell.

dixie_land3y ago

Yup we saw the exact same symptoms with some GCLBs getting 100% 502 ( our upstream QPS graph looks scary with 5 mins of 0 QPS )

bushbaba3y ago

This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

tgtweak3y ago

You can't really have 30+ fully independent regions running their own stack with different versions of apps and separate secrets, IP/routing and certificates in each. At some point you have to unify or it becomes either unmanageable or inconsistent.

markstos3y ago

Right. You want regions to be fully independent, yet the software stacks they are running to be fully synchronized and consistent. So there’s a tension. If there’s a sleeper bug that wakes only after it has been rolled out to every region, you’ve got a global outage. Given the increasing complexity of these systems, it will always be possible to find all those.

skywhopper3y ago

Most of GCP’s customers can’t, but independent regions are one of the benefits that a well architected cloud provider can give you to build on.

dastbe3y ago

do you mean the cloud provider can't, or the customer can't?

xwolfi3y ago

But you can have 3. Why did you choose 30?

In my company we are split in 3, US, EU, APAC, and we have the same issue with global outage for stuff we could have just managed regionally. For all the savings of the global architecture, they disappear each minute a client is down on a global outage because a guy thousands of kms away messed up.

You dont have to unify, at all. You dont unify with your competitors, and the world has not exploded: compete internally between regions ?

1 more reply

esperent3y ago

How does this fit in with upcoming EU data sovereignty laws?

zamnos3y ago

The underlying problem is that Google doesn't operate the world's DNS servers, but still wants to offer the best possible user experience as a global service. This means anycast VIP routing, because not all DNS servers implement EDNS, but they want to have SSL connections terminate as closely to users as possible.

As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?

How does new config going to go out, globally, without being global? How do global services work if they're not global? How does DDoS protection work if you don't do it globally?

People make fun of "webscale" but operating Google is really difficult and complicated!

otterley3y ago

https://aws.amazon.com/builders-library/automating-safe-hand...

1 more reply

medler3y ago

> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

As I understand it, GCP is already designed to make global outages impossible. Obviously this outage shows that they messed up somehow and some global point of failure still remains. Looking forward to the post-mortem.

dilyevsky3y ago

They had many many global outages through the years so that’s evidently not true. GCLB, iam, gcs and probably more Im missing just of the top of my head. Then there’s constant stream of regional networking borks where your latency is suddenly 5x which are not “global” but affect multiple regions

2 more replies

consumer4513y ago

My knowledge level: can use AWS console to do < 5% of what is possible.

How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?

singron3y ago

Most of it is cellular or regional, but there are a few critical global services. The global network load balancing, network qos, and ddos prevention are more functional because they are global (i.e. you couldn't replace them with equivalent regional versions), but are often causes of issues like this. There was a push a few years ago to ensure global services had at least 99.999% uptime or make them regional. This was a 48 minute outage, so it blows that five 9 budget for 9 years.

Ex-googler, no particular knowledge of this event, information might be out of date.

1 more reply

NineStarPoint3y ago

Assuming good automation, most of the work comes in being able to do a second of something instead of just having one. The difference in work between “single point” and “multiple point” is a lot, but increasing the multiple points beyond that isn’t too bad.

Of course, if you deploy a change to all of your separated stacks at once through some sort of automated pipeline it doesn’t matter too much. Easy to break everything simultaneously that way if there’s some difference between test and prod you didn’t realize was there.

zamnos3y ago

If you get into the nitty gritty of it, it doesn't really make sense. Are you going to have 5 different load balancer software stacks, with 5 different config file languages, causing each client (say Gmail) to have to implement their config 5 different ways? That's insane.

jjoonathan3y ago

My biggest AWS surprise bill (so far!) was due to a bug in AWS console region switching.

kevinventullo3y ago

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.

uniformlyrandom3y ago

From the messaging, this seems like a partial network outage.

Of course, at Google scale 'partial' is still very big.

Terretta3y ago

> This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

And why enterprises clamoring for AWS to feature match Google's global stuff (theoretically making I.T. easier) instead of remaining regionally isolated (making I.T. actually more resilient, without extra work if I.T. operators can figure out infra-as-code patterns) should STFU and learn themselves some Terraform, Pulumi, or etc.

Also, AWS, if you're in this thread, stop with the recent cross-region coupling features already. Google's doing it wrong, explain that, and be patient, the market share will come back to you when they run out of the GCP subsidy dollars.

verdverm3y ago

You really want to go through every region to find what VMs are running? Why can this not be a single page with all VMs listed?

1 more reply

dotancohen3y ago

  > gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

If that's what you really need, then distribute your assets across GCP, AWS, and DO. That likely means not using any cloud-specific features such as Lambda. AWS is actually really good in this regard, as SES and RDS are easily copied to regular instances in other cloud providers, that possibly wrap some cloud-specific feature themselves.

talonx3y ago

But....cost.

1 more reply

Cthulhu_3y ago

For reference / comparison, how many regional outages have there been? Did service outages get avoided due to running a workload in multiple regions?

papruapap3y ago

Because copypasting from A to B is much safer...

dekhn3y ago

there are three things that scare google engineers enough to keep them up at night: a global network outage, a global power outage, and a global chubby outage. Actually, they only really worry about that last one.

typaty3y ago

https://packages.cloud.google.com/apt/doc/apt-key.gpg Even the public apt key for signing Google's cloud packages is unavailable (returns 500 for me). This is insane

nicholasklem3y ago

This key was 500 some hours before the incident started, I hope it's unrelated.

typaty3y ago

inb4 it turns out an intern was tasked with updating the apt key, which brought a cascading outage of all their services

roseway43y ago

Downloading the key has been erroring since at least ~5pm PT yesterday, 2/27. It’s likely unrelated. Though I’d be unsurprised if the recent layoffs contributed to the situation.

eik3_de3y ago

Google cloud bugtracker bug: https://issuetracker.google.com/issues/270782614?pli=1

chedabob3y ago

Currently being tracked here: https://github.com/GoogleCloudPlatform/gcsfuse/issues/961

kkielhofner3y ago

As has happened many times throughout history (back to mainframes and thin clients of the 90s) there are swings/trends in how infrastructure is hosted.

Listening to the “All In Podcast” yesterday even those guys were talking about revenue drops in the big cloud services and noting we’re currently in the midst of a swing back to self-hosting/co-location/whatever thinking and migrations out.

IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.

It also has the added benefit of de-centralizing the internet a bit (even if only a little).

lolinder3y ago

As others have mentioned, there was no revenue drop, there's been a reduction in growth. AWS's 20% growth rate is still very respectable, more than double the 9% growth rate the company had overall.

I would be hesitant to attribute slowed growth to a return to self hosting, it's much more likely that it's caused by companies dialing back their cloud growth after spending a few years going ham digitizing everything during the pandemic.

DyslexicAtheist3y ago

parent still has a very strong point considering that a drop in growth (not revenue) quickly translates in projects / features being cancelled. That's a good thing to FailFast from a start-up pov but when me as a start-up needs to make a bet about building on top of certain features this adds to my cost/benefit calculation when deciding if I want to jump on new features (device-shadows, digital-twins, or whatever else is the latest innovation the cloud announces).

From that pov I expect my platform to behave like a utility (never change or only change with strict backward compatibility). That level of control simply is against the business model of the cloud.

1 more reply

kkielhofner3y ago

Ah yes, sorry, slower than expected growth was the data point. In my defense I had a screaming toddler in the car!

That said I think the point generally remains - one could argue slower than expected growth in cloud services is a revenue drop (in a way) vs expectations. The market responded accordingly[0] - "However, Azure growth is decelerating." Note that this is all including the explosion in "2023 AI hotness" which is almost certainly offsetting what would be larger losses due to the shift I'm arguing. As the All In Guys noted "you won't see a pitch deck without the letters AI in it" - and a good chunk of that is still going to cloud providers as (in my opinion) there are long tails to these changes and many existing solutions/applications getting "AI" slapped on them are effectively trapped in $BIGCLOUD.

Self-hosting AI is also significantly more difficult and upfront more expensive when you start looking at dealing with (typically) Nvidia hardware costs and software stack complexity. I can definitely see many of these "pivots" to "something AI, we need to throw AI in this" the more well understood and initially faster and "cheaper" utilization of cloud services will continue until the AI trend stabilizes.

From what I could hear (and process) over the screaming the All In Guys presented the argument I tend to agree with - a resurgence of self-hosted infrastructure.

Companies are also dialing back cloud spend because they're realizing for many applications it's very expensive relatively and can actually be limiting compared to self-hosting[1]. Per usual when the cheap money and economic boom retracts they start actually looking at costs they were once happy to just keep writing checks for.

I'd like to reiterate there's a lot of calculation and strategy when it comes down to selecting infrastructure hosting. Again, I think we're in a period where there's a bit of a sea change/wakeup from the past decade of "of course you always build and host everything in $BIGCLOUD" - without even remotely considering alternatives. It's been the default for a while and it isn't as much anymore - and I'd argue that trend is accelerating. There is no "one size fits all".

[0] - https://www.investors.com/news/technology/msft-stock-microso...

[1] - https://www.linkedin.com/pulse/snapchat-earnings-case-runawa...

2 more replies

ranman3y ago

The parent comment is neither factual nor advisable.

You build greenfield in cloud precisely because it is greenfield and the utilization isn't well understood. Cloud options let you adjust and experiment quickly. Once a workload is well understood it's a good candidate for optimization, including a move to self managed hardware / on prem.

Buying hardware is a great option once you actually understand the utilization of your product. Just make sure you also have competent operators.

1 more reply

ctvo3y ago

AWS is a 75 bln a year business still growing 20%+ YoY. It’ll break 100 bln this year. I would examine the numbers yourself.

kkielhofner3y ago

I have - and the numbers show that much of the big cloud growth is in AI services. The "we need to throw in AI somewhere" concurrent trend is heavily bolstering what would other wise be much more drastic retractions in growth.

I would argue as the AI trend (eventually) wanes and many AI startups and projects within existing companies inevitably eventually fail to materialize the much longer and more general trend of migration out of $BIGCLOUD will be more drastic and obvious.

I don't buy individual stocks but I would happily bet a dinner on big cloud growth showing substantial reductions/losses in coming years as the overall situation stabilizes.

1 more reply

outworlder3y ago

> IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.

When one buys a house, they should take a hard loo at whether the default approach of paying for utilities makes sense, versus generating their own power.

While that's a bit snarky, the reasoning is similar. You can:

* Use "bigcloud"(TM) with the whole kit: VMs, their managed services, etc * Use bigcloud, but just VM or storage * Rent VMs from a smaller provider * Rent actual servers * Buy your servers and ship to a colo * Buy your servers and build a datacenter

Every level you drop, you need more work. And it grows(I suspect, not linearly). Sure, if you have all the required experts (or you rent them) you can do everything yourself. If not, you'll have to defer to vendors. You will pay some premium for this, but it's either that, or payroll.

What also needs to be factored in is how static your system is. If a single machine works for your use-case, great.

One of the systems I manage has hundreds of millions of dollars in contracts on the line, thousands of VMs. I do not care if any single VM goes down; the system will kill it and provision a new one. A big cloud provider availability zone often spans across multiple datacenters too, each datacenter with their own redundancies. Even if an entire AZ goes down, we can survive on the other two (with possibly some temporary degradation for a few minutes). If the whole region goes down, we fallback to another. We certainly don't have the time to discuss individual servers or rack and stack anything.

It does not come cheap. AWS specifically has egregious networking fees and you end up paying multiple times (AZ to AZ traffic, NAT gateways, and a myriad services that also charge by GB, like GuardDuty). It adds up if you are not careful.

From time to time, management comes with the idea of migrating to 'on-prem', because that's reportedly cheaper. Sure, ignoring the hundreds of engineers that will be involved in this migration, and also ignoring all the engineers that will be required to maintain this on-premises, it might be cheaper.

But that's also ignoring the main reason why cloud deployments tend to become so expensive: they are easy. Confronted with the option of spinning up more machines versus possibly missing a deadline, middle managers will ask for more resources. Maybe it's "just" 1k a month extra (those developers would cost more!). It gets approved. 50 other groups are doing the same. Now it's 50k. Rinse, repeat. If more emphasis would be placed into optimization, most cloud deployments could be shrunk spectacularly. The microservices fad doesn't help(your architecture might require that, but often the reason it does is because you want to ship your org chart, not for technical reasons).

esperent3y ago

> When one buys a house, they should take a hard loo at whether the default approach of paying for utilities makes sense, versus generating their own power.

Yes, people do. They install solar panels and use them to generate at least some of their own power. Near future battery tech might allow them to generate all of it if they get enough sunlight, in which case this will become a genuine question to answer: how much to install and maintain the panels and batteries over their lifetime, vs expected cost of purchasing power from utilities.

In a similar manner, cloud vs self hosting is a valid consideration that changes over time. We now have docker and similar tools which make managing your own infrastructure much easier than it was ten years ago. I fully expect even better tools will come out in the future so this consideration does change over time. Maybe in another ten years there'll be almost no benefit to using the cloud (except maybe as a CDN).

1 more reply

camhart3y ago

All in podcast mentioned growth slowing, but not revenue dropping.

knorker3y ago

Revenue drop? Google Cloud is still growing 30-40% year on year.

rcme3y ago

AWS also had 20% revenue growth last quarter.

1 more reply

Dave3of53y ago

Ouch some pain at google today then. I hate to wake up on a Monday morning to this.

<3 To the engineers trying to fix it at the moment.

zamnos3y ago

Google has follows-the-sun on-call rotations for large rotations, so this hit the UK team just after lunch.

bongobingo13y ago

Ah so the rotation rotates to match the current rotation. Very smart.

2 more replies

monero-xmr3y ago

This is why any criticism of AWS reliability is meaningless to me. All the cloud providers go down - all of them. Either you are multi-cloud, or you run your own hardware, but these events are inevitable.

vhiremath43y ago

The amount of time you are down vs. up dictates your SLOs and SLAs. Criticism of how reliable one vs. another is is not only valid, it's backed by hundreds of millions of contractual dollars and credits every year. We spend tens of millions on AWS per year. We have several SLAs with them. Our Elasticache SLA was breached once (localized to us - not whole customer base) and we got credits which were commensurate with the amount of business we lost during that downtime period.

If one provider is down more than the others, the criticism is not only valid, it results in real business loss for the provider and its customers.

On multi-cloud: it's one way to reduce the amount of downtime you have, but it comes with a significant operational cost depending on how your application is architected and how your teams internal to your company are formed. It is totally practical for someone to bank on AWS' reliability until they're at a significant amount of traction or revenue where the added uptime of going multicloud is worth the investment. I know you're not saying this isn't the case (I think you're saying "do that if you're going to complain about 1 providers' uptime"), but thought it was worth putting the context into the HN ether.

dehrmann3y ago

You definitely need to look at your SLA with your customers, but in my experience, multi-cloud isn't worth it. It's easier to be slightly less reliable, and throw your top-three cloud provider under the bus in the public post mortem. You'll probably cause bigger outages on your own in between provider outages, and multi-cloud adds another layer of complexity for things to go wrong.

Multi-cloud is saying you think you can manage Kafka across two or three clouds better than GCP can manage Pub/Sub.

yjftsjthsd-h3y ago

This is why any criticism of AWS > reliability is meaningless to me.

Er, we absolutely can and should compare rates of problems and overall reliability.

crazygringo3y ago

> Either you are multi-cloud, or you run your own hardware

If you run your own hardware these events are inevitable too.

elsonrodriguez3y ago

I've seen skepticism about GCP and AWS availability from people with a single 2U in a closet somewhere.

I know it's just a psychological thing about giving up "control", but I have to stifle a chuckle every time.

3 more replies

dymk3y ago

Inevitable != immune to criticism

ctvo3y ago

> This is why any criticism of AWS reliability is meaningless to me.

Is anyone tracking reliability for these public providers? Would be curious how AWS compares to Azure and GCP. My experience is it's better, but we may have avoided Kinesis or whatever that keeps going down.

WaxProlix3y ago

There's Cloudharmony, https://cloudharmony.com/status

MuffinFlavored3y ago

> you run your own hardware

in multiple datacenters?

uniformlyrandom3y ago

05:41 - 06:26 PT, 45 min total.

Not great, not terrible.

throwaway8922383y ago

Yep. Of course there's no detail yet so we don't know what exactly was affected. All we can see is "Multiple services are being impacted globally" and a list of services (Build, Firestore, Container Registry, BigQuery, Bigtable, Networking, Pub/Sub, Storage, Compute Engine, Identity and Access Management) but there's no indication of what specifically was impacted. Could you still see status for your VMs, but not launch new ones? Was it mostly affecting only a couple regions? No idea. All we know is they're now below four nines in February for a handful of critical services.

Let's take a gander at incident history: https://status.cloud.google.com/summary

Cloud Build looks bad... three multi-hour incidents this year, four in fall/winter last year.

Cloud Developer Tools have had four multi-hour incidents this year, many last fall/winter.

Cloud Firestore looks abysmal... Six multi-hour incidents this year, one of them 23 hours.

Cloud App Engine had three multi-hour incidents this year, many in fall/winter last year.

BigQuery had three multi-hour incidents this year, many in fall/winter last year.

Cloud Console had five multi-hour incidents this year, many in fall/winter last year. (And from my personal experience, their console blows pretty much all the time)

Cloud Networking has had nine incidents this year, one of them was eight days long. What the fuck.

Compute Engine has had five multi-hour incidents this year, many last fall/winter.

GKE had 3 incidents this year, multiple the past winter.

Can somebody do a comparison to AWS? This seems shitty but maybe it's par for the course?

joatmon-snoo3y ago

Ex-GCP here.

This is a pretty reductionist summary, e.g. the 8-day Cloud Networking incident root cause:

> Description: Our engineering team continues to investigate this issue and is evaluating additional improvement opportunities to identify effective rerouting of traffic. They have narrowed down the issue to one regional telecom service provider and reported this to them for further investigation. The connectivity problems are still mostly resolved at this point although some customers may observe delayed round trip time or longer latency or sporadic packet loss until fully resolved.

Still a big problem product-wise, but you're looking at a global incident history view without any region/severity filters.

The corresponding AWS service health dashboard makes it much harder to view this level of detail, but is also actually useful for someone asking "is product $xyz which I depend on in region $abc currently down or not"

Rebelgecko3y ago

It's weird, I did a cursory search and can't find people complaining about that 8 day long networking issue. I wonder if the latency was just barely out of SLO so people didn't notice? Or since it was a telecom problem, maybe it was part of one of the recent undersea cable outages so people weren't surprised enough to remark on it? Or maybe I'm just not searching well.

(full disclosure, work at Google but not on cloud stuff)

0x00000003y ago

Outages at the hyperscalers can have a huge blast radius, is anyone encountering other services with outages because they're built on GCP?

hellcow3y ago

We are in us-central1 and didn't have an outage, so it appears not to have affected everyone.

2OEH8eoCRo03y ago

I fantasize that it's a three-letter agency with a warrant making them either start shitting their chat logs or pulling drives and recovering them the hard way.

https://arstechnica.com/tech-policy/2023/02/us-says-google-r...

JosephRedfern3y ago

Sounds painful.

lokl3y ago

mail.google.com showed error messages for me intermittently during the past hour.

jakedata3y ago

https://www.google.com/appsstatus/dashboard/incidents/5ML14k...

They claim the Gmail specific issues are resolved. We shall see...

Feb 27, 2023 2:03 PM UTC We experienced a brief network outage with package loss, impacting a number of workspace services. The impact is over. We are investigating and monitoring.

asicsp3y ago

discussed here: https://news.ycombinator.com/item?id=34955906

oars3y ago

Is it likely this outage still would've have occurred even without their 12,000 layoffs in January?

1 more reply

Aldipower3y ago

Certainly a problem with a BGP misconfiguration. :)

zamnos3y ago

Nope. A BGP misconfiguration would manifest in more broad/different ways.

m00dy3y ago

Our workloads are fully functional, DK/EU

j / k navigate · click thread line to collapse

105 comments

fastest9633y ago

dixie_land3y ago

Yup we saw the exact same symptoms with some GCLBs getting 100% 502 ( our upstream QPS graph looks scary with 5 mins of 0 QPS )

bushbaba3y ago

This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

tgtweak3y ago

markstos3y ago

skywhopper3y ago

Most of GCP’s customers can’t, but independent regions are one of the benefits that a well architected cloud provider can give you to build on.

dastbe3y ago

do you mean the cloud provider can't, or the customer can't?

xwolfi3y ago

But you can have 3. Why did you choose 30?

You dont have to unify, at all. You dont unify with your competitors, and the world has not exploded: compete internally between regions ?

1 more reply

esperent3y ago

How does this fit in with upcoming EU data sovereignty laws?

zamnos3y ago

As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?

How does new config going to go out, globally, without being global? How do global services work if they're not global? How does DDoS protection work if you don't do it globally?

People make fun of "webscale" but operating Google is really difficult and complicated!

otterley3y ago

https://aws.amazon.com/builders-library/automating-safe-hand...

1 more reply

medler3y ago

> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

dilyevsky3y ago

2 more replies

consumer4513y ago

My knowledge level: can use AWS console to do < 5% of what is possible.

How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?

singron3y ago

Ex-googler, no particular knowledge of this event, information might be out of date.

1 more reply

NineStarPoint3y ago

zamnos3y ago

jjoonathan3y ago

My biggest AWS surprise bill (so far!) was due to a bug in AWS console region switching.

kevinventullo3y ago

gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.

I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.

uniformlyrandom3y ago

From the messaging, this seems like a partial network outage.

Of course, at Google scale 'partial' is still very big.

Terretta3y ago

> This demonstrates yet again why global configurations, global services, and global anycast VIP routing should be considered an anti pattern.

verdverm3y ago

You really want to go through every region to find what VMs are running? Why can this not be a single page with all VMs listed?

1 more reply

dotancohen3y ago

  > gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.