Google Kubernetes Engine's third consecutive day of service disruption (opens in new tab)

(status.cloud.google.com)

779 pointsrlancer7y ago407 comments

407 comments

I am currently evaluating GCP for two separate projects. I want to see if I understand this correctly:

1) For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement). It was also questionable whether a user would be able to launch a simple compute instance (according to statements here on HN).

2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.

3) The sum total of information about this incident can be found as a few one or two sentence blurbs on Google's blog. No explanation nor outline of scope for affected regions and services has been provided.

4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.

5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.

6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.

I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.

manigandham7y ago

When everything works, GCP is the best. Stable, fast, simple, reliable.

When things stop working, GCP is the worst. Slow communications and they require way too much work before escalating issues or attempting to find a solution.

They already have the tools and access so most issues should take minutes for them to gather diagnostics, but instead they keep sending tickets back for "more info", inevitably followed by a hand-off to another team in a different time zone. We have spent days trying to convince them there was an issue before, which just seems unacceptable.

I can understand support costs but there should be a test (with all vendors) where I can officially certify that I know what I'm talking about and don't need to go through the "prove its actually a problem" phase every time.

laurencei7y ago

As someone who works for Government and Enterprise - all I care about sometimes is how a company behaves when everything goes wrong.

The issue with outages for the Government organizations I have dealt with is rarely the outage itself - but strong communication about what is occurring and realistic approximate ETAs, or options around mitigation.

Being able to tell the Directors/Senior managers that issues have been "escalated" and providing regular updates are critical.

If all I could say was a "support ticket" was logged, and we are waiting on a reply (hours later) - I guarantee the conversation after the outage is going to be about moving to another solution provider with strong SLAs.

4 more replies

Osiris7y ago

"Support costs" calculation often doesn't include the costs of not having support.

When I worked at GoDaddy, there were around 2/3 of the company was customer support.

At the current company I'm at, a cryptocurrency exchange, our support agents frequently hear they prefer our service over others because of our fast support response times (crypto exchanges are notorious for really poor support).

All of my interactions with Amazon support have been resolved to my satisfaction within 10 minutes or less.

Companies really ought to do the math on the value that comes from providing fast, timely, and easy (don't have to fight with them) customer support.

Google hasn't learned this lesson.

1 more reply

jgalentine0077y ago

With Dell you can certify with them so you can get replacement parts and such without the BS back and forth with some guy in india. Saves everyone time and money.

1 more reply

dvdgsng7y ago

> instead they keep sending tickets back for "more info"

Isn't that the case with basically every support request, no matter the company or severity? The first couple of emails from 1st & even 2nd level support are mostly about answering the same questions about the environment over and over again. We've had this ping-pong situation with production outages (which we eventually analysed and worked around by ourselves) and fairly small issues like requesting more information of an undocumented behavior which didn't even effect us much. No matter how important or urgent the initial issue was, eventually most requests end up being closed unresolved.

2 more replies

softawre7y ago

Heh, your "test" reminds me of an old Hanselman article:

https://www.hanselman.com/blog/FizzBinTheTechnicalSupportSec...

ElBarto7y ago

To say "when it works it's stable and reliable" implies that it is neither...

1 more reply

dilyevsky7y ago

We had an issue a few weeks back where all nodes in west1-a could not pull docker images. Google support was pinballing P1 issue around the globe and across multiple teams for a few days untill I root caused it for them - turned out to be gce service account issues affecting entire zone. 2 days to rollback (no status page update). I know nobody gives a fuck but can’t help but feel vindicated as an ex google sre.

icelancer7y ago

I think a lot of people give a fuck here; I do, at least. Thanks for outlining it, these things are fascinating (to me anyway, who has never worked in IT/ops).

navinsylvester7y ago

We are GCP customers for the last couple of years. We use other cloud platforms(AWS, IBM, Oracle, OrionVM) too. We don't use GKE but use rancher/kubernetes combo on their standard platform.

So far GCP is the best, hands down in terms of stability. We never had a single outage or maintenance downtime notification till now. We are power users but our monitoring didn't pick any anomaly so i don't think this issue had rampant impact on other services.

But i find it concerning that they provided very little update on what went wrong. I also think its better to expect nil support out of any big cloud provider if you don't have paid support. Funny how all these big cloud providers think you are not eligible for support de-facto. Sigh.

rogerkirkness7y ago

I agree with this. Compared to AWS, when Google says it's down, it's down, and that's rare. When they say it's up, it's up.

vira287y ago

I use AWS free tier and get customer support through email, but thats not the case with GCP. Do they provide free email support?

If you are an early stage startup can you afford their 200/Month support, when your entire GCP bill is under $1. However, that doesn't mean you don't have to support them.

2 more replies

ToFab1237y ago

I don't understand why someone would choose to deploy anything mission critical without having an support contract with the ISP, the manufacturer of the the software etc.

3 more replies

lilbobbytables7y ago

You're doing me a scare. I'm in the evaluation phase with them. Maybe I'm missing something here, but this is not at all what the linked post says.

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

So, it's a UI console issue, it appears you can still manage

"Affected customers can use gcloud command [1] in order to create new Node Pools. [1]"

Similarly, it actually was resolved in Friday, but they forgot to mark it as so.

"The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific."

shareometry7y ago

You are right about the Google blog content itself not indicating three days of outage. Turns out they just forgot to mark that particular issue as resolved on Friday, as you point out. This is my mistake. I would update my comment to reflect this, but it doesn't seem to allow an edit at this point.

The items I put down in my comment are based largely on user reports, though (there isn't much else to go on). And I mean these items as questions (i.e. "is this accurate?"). Folks here on HN have definitely been reporting ongoing problems and seem to be suggesting that they are not resolved and are actually larger in scope than the Google blog post addressed.

Someone from Google commented here a few hours ago indicating Google was looking into it. And other folks here are reporting that they don't have the same problems. So it's kind of an open question what's going on.

I'm in the evaluation phase too. And I've found a lot to like about GCP. I'm hoping the problems are understandable.

haldora7y ago

I've been failing all weekend to create nodes in a GKE cluster through either the UI console or gcloud. Even right now I can't get any nodes to spin up.

Edit: I finally got my cluster up and running by removing all nodes, letting it process for a few minutes, then adding new nodes.

timdumol7y ago

We've had no issues deleting and creating node pools this weekend (on asia-east1-a). No other problems noticed either.

fizzledbits7y ago

As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request" An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

On Saturday I created yet another clone in northamerica-northeast1-b. That worked Saturday and Sunday, but this morning, it is failing to start. Fortunately my us-west2-c instance has begun to work again, but I'm having doubts about continuing to use GCE as we scale up.

And yet, the status page says all services are available.

raincom7y ago

If you run your own k8s on GCP, you are not going to be affected by GKE.

aviv7y ago

I can't comment regarding GKE as we don't use that particular service, however we are very heavy users of many other GCP services, including Compute, Datastore, BigQuery, Pub/Sub, Storage, Functions, Speech, and others. Zero issues this weekend, everything is running 100% as any normal day.

tejohnso7y ago

> For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement)

What blog statement are you referring to? I don't see any such statement. Can you provide a link?

The OP incident status issue says "We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI". It also says "Affected customers can use gcloud command in order to create new Node Pools."

So it sounds like a web interface problem, not a severely limiting, backend systems problem with global scope.

Also, the report says "The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific". So the whole issue lasted about 10 hours, not three whole days.

> Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems

I don't see much of that.

paulddraper7y ago

I believe the OP was referring to the very same blog (web log) you cited.

https://status.cloud.google.com/incident/container-engine/18...

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

> So it sounds like a web interface problem, not a severely limiting

Depends who you as to whether this is "severely" limiting, but yes there is a workaround by using an alternate interface.

marcinzm7y ago

Right now we don't know. It's one of two possibilities from what I can tell:

a) Google had a global service disruption that impacted Kubernetes node pool creation and possible other services since Friday. They had a largely separate issue for a web UI disruption (what this thread links to) which they forgot to close on Friday. They still have not provided any issue tracker for the service distribution and it's possibly they only learned about it from this hacker news thread.

b) People are having various unrelated issues with services that they're mis-attributing to a global service disruption.

johnpython7y ago

This is why GCP has no hope of ever taking significant market share from AWS. Google thinks they can treat their cloud customers like they treat users of their free services. Customer support and communication are essential.

ernsheong7y ago

As if something like this has never happened to AWS?

2 more replies

halbritt7y ago

I'm not sure about the market share, but I agree with the last two sentences.

...and I'm a happy GCP customer.

rorykoehler7y ago

I recently removed my hosting from GCP. The pricing is confusing and unbelievable. Their customer service is a joke. I don't trust Google for longterm consistency due to the way they shut their own apps but I let that slide as I doubt they will do that on their cloud services. I have experience with AWS (rock solid, world class support but also costly), digital ocean (improving fast), heroku (good for beginners but also expensive and not as full featured as AWS) and finally Hetzner (too early to judge).

meow_mix7y ago

I think you're missing the portion about how it only appears to be the console ui, no?

ransom15387y ago

“2) This issue was global in scope, affecting all of Google's regions. Therefore, in consideration of item 1 above, it was questionable/unpredictable whether or not a user could launch a node pool or even a simple node anywhere in GCP at all.”

Ok. So on aws we were* paying for putting systems across regions, but, honestly I don’t get the point. When an entire region is down what I have noticed is that all things are fucked globally on aws. Feel free to pay double - but it seems* if you are paying that much just pay for an additional cloud provider. Looks like it’s the same deal on GCP.

human_error7y ago

> When an entire region is down what I have noticed is that all things are fucked globally on aws.

Do you have an example on this?

2 more replies

usmannk7y ago

We had an issue a few weeks ago where the google front-end servers were mangling responses from Pub/Sub and returning 502 responses, making the service completely unusable and knocking over a number of things we have running in production. Despite paying for enterprise support and having in a P1 ticket, we had to spend Friday to Sunday gathering evidence to prove to the support staff that there was indeed a problem, because their monitoring wasn't detecting it. Right now I'm doing something similar (and since Friday!) but for TLS issues they're having. Again, because their support reps don't believe there's a problem. There are so many more problems than they ever show on their status page...

Jedi727y ago

They work for Google so obviously they are much smarter than you. If theres a problem its probably the customers fault. /sarcasm

mbrumlow7y ago

I was so mad to read that until you said /sarcasm :p

That being said I really do think there is a difference between who is working at google today and the google we all fell in love with pre-2008.

I am sure there are a amazing people still working at google, but nowhere near like it was.

The way I like to think about google is that some amazing people mad ea awesome train that builds tracks in front of it -- you can call them gods maybe -- but those people are gone -- or a least the critical mass required to build such a train has dwindled to just dust. What we have left is a awesome train full of people pulling the many levers left behind.

To make things even worse my last interview as a SRE left me wondering if even the people who are there know this as well, and they are actually working hard to keep out those who might expose light on to this. I don't say that because I did not get the job -- I am actually happy I did not get extended a offer.

I say this with one exception, the old-timer who was my last interview. I could tell he was dripping in knowledge and eager to share it with any that would listen. I came out of his 45 min session learning many things -- I wold actually pay to work with a guy like that.

I would also like to point out that the work ethic was not what I expected. I was told that when on call, my duty was to figure out the root cause was in the segment I was responsible for. I don't know about you, but if my phone rings at night I am going to see through to a resolution and understand the problem in full -- even if it is not on the segment that I was assigned.

/end rant

3 more replies

jkaplowitz7y ago

Many of the tier 1 GCP support reps work for external vendors nowadays, which is probably part of the problem.

During my time on the GCE team (note I don't work at Google now) I knew multiple full-time Google employee support reps, including some still at the company. They have the good attitude and deep knowledge you'd hope for.

The problem is simply about how Google scales their GCP support org. To be completely clear, AWS support is by and large not great either.

If you're a big or strategically important customer, of course, you can get a good response from either company.

2 more replies

lstamour7y ago

I totally heard that when trying to get engineer attention to YouTube Premium’s frequent “download errors” due to their transcoding-on-the-fly (or something). I was telling a support rep I had evidence that if I switched the setting from standard to high def (or vice versa) that the error would go away, but I could reproduce it with the same video, and thought it was a CDN/transcode issue. They kept marking the ticket as “unable to reproduce” and I had to wonder — as a paying customer, don’t they have analytics on my phone that would show exactly the request I was making which was failing in their logs? And if they saw it succeed, why not tell me the problem was my ISP? I’d have been happy to follow up... but nothing’s ever wrong in Google-land. :/

reaperducer7y ago

Again, because their support reps don't believe there's a problem.

Perhaps if you explained it on a whiteboard...

navinsylvester7y ago

In general GCP has quota limits so its expected of customers to catch 5xx and do exponential back off. But this info is not explicitly stated.

From my personal experience - i think all big cloud providers first two level support staff is no good if it isn't an obvious dumb one on your part. I always prefer to forgo support and try to go through every bit of their documentation to figure it out on our own. This helps to save huge amount of time. But if you have developer support - it can help to expedite things little faster though.

halbritt7y ago

Did they ask you for a screenshot?

That's my favorite.

StreamBright7y ago

I just took a screenshot of this and any time somebody asks my why AWS i will forward them. Thanks!

Jedi727y ago

"The data says engagement is down 46%, I think its time we drop the product."

- Someone at Google right now, probably.

justinsb7y ago

I can assure you that's not the case! Also, while people like to repeat this meme, Google Cloud does have a formal deprecation policy (https://cloud.google.com/terms/), whose intent is to give you some assurances.

(I work at Google, on GKE, though I am not a lawyer and thus don't work on the deprecation policy)

chrisseaton7y ago

> Google may discontinue any Services or any portion or feature for any reason at any time without liability to Customer

for any reason

at any time

4 more replies

brian-armstrong7y ago

What happens when they suddenly deprecate the deprecation policy?

1 more reply

whydoineedthis7y ago

im pretty sure he just forget the /s (sarcasm) on his post, but this was pretty cool information anyway, so thanks!

davemp7y ago

I think it’s telling of Google’s culture that the corporate arm felt the need to formalize this in law. I won’t pretend to know what it’s telling. Just suggest that you listen for yourself. Look at rule of law versus the ideas of liberty if you’d like a stronger nudge.

1 more reply

justinsb7y ago

Hi - I work at Google on GKE - sorry about the problems you're experiencing. There's a lot of people inside Google looking into this right now!

It looks like the UI issue was actually fixed, and that we just didn't update the status dashboard correctly. But we're double checking that and looking into some of the additional things you all have reported here.

antpls7y ago

The status dashboard is inaccurate and/or a lie. It only tells about the GKE incident, while in fact the problem also impacts Google Compute Engine users. I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west.

As another comment pointed out, what's the point of having so many zones and redundancy around the globe if such global failure can still happen? I thought the "cloud" was supposed to make this kind of failure impossible

stevehawk7y ago

This is unfortunately the norm. Like when AWS S3 went down (but couldn't update its own status images because they're in S3 and we all laughed) and along with it went Alexa, lambda, and every other service dependent on S3.

1 more reply

carbocation7y ago

> I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west.

I've been creating GCP instances in us-central1-a and us-central1-c today without issue. Which zone were you using in NA?

I have been noticing unusual restarts, but I haven't been able to pin down the cause yet (may be my software and not GCP itself).

2 more replies

0xbadcafebee7y ago

> I thought the "cloud" was supposed to make this kind of failure impossible

You have to remember that you're trying to have access to backend platforms and infrastructure at all times, which almost no public utility does (assuming "the cloud" is "public utility computing"). Power plants go into partial shutdown, water treatment plants stop processing, etc. Utilities are only designed to provide constant reliability for the last mile.

If there's a problem with your power company, they can redirect power from another part of the grid to service customers. But some part of your power company is just... down. Luckily you have no need to operate on all parts of the grid at all times, so you don't notice it's down. But failure will still happen.

Your main concern should be the reliability of the last mile. Getting away from managing infrastructure yourself is the first step in that equation. AppEngine and FaaS should be the only computing resources you use, and only object storage and databases for managing data. This will get you closer to public utility-like computing.

But there's no way to get truly reliable computing today. We would all need to use edge computing, and that means leaning heavily on ISPs and content provider networks. Every cloud computing provider is looking into this right now, but considering who actually owns the last mile, I don't think we're going to see edge computing "take over" for at least a decade.

aiisjustanif7y ago

> I thought the "cloud" was supposed to make this kind of failure impossible

If set up properly to be utilized correctly, yeah. But, it's not a perfect world though.

davemp7y ago

I’ll suggest considering whether entities enamored with centralizing ideals are more likely to fail to properly realize the robustness of a distributed system.

aviv7y ago

We have created GCE instances in several US regions without any issue today. Last one was 10 minutes ago in west2.

marcinzm7y ago

I appreciate all the effort you're putting in and I understand such situations can be stressful but user's having to depend on someone responding on hacker news for status updates seems really amateur for an organization the size of google.

NicoJuicy7y ago

The default is : https://status.cloud.google.com/incident/container-engine/18...

People who respond here could be employees of Google, caring about it and respond here because they know it.

What he can mention ( a lot of people are working on it) is what you can suspect when something is going down. All other cloud providers do the same.

1 more reply

trhway7y ago

>really amateur for an organization the size of google.

There is a reason while Google have been having hard time making inroads in the enterprise cloud. Kind of impedance mismatch between enterprise and the Google style. That 2 stories like high "We heart API" sign on the Google Enterprise building facing 237 just screams about it :)

rdtsc7y ago

Strangely and sadly with gmail account blocking and other such issues HN and Twitter is often better way to get Google's support than to contact support.

ben_jones7y ago

As much as I love bashing big corps I see HN as a supplementary communication channel for products like GCP - its a luxury we get to access alongside normal customer support channels in the GCP console, twitter, etc.

2 more replies

tomcam7y ago

Thanks for jumping in here on your own time. The following question is not meant to be hostile, it is merely curiosity. Isn’t this supposed to be the kind of thing that monitoring and diagnostics software should find automatically? Serious question, not meant to embarrass you.

rlancerOP7y ago

Creating clusters via the UI is still not working for me.

rlancerOP7y ago

UPDATE: Created a Cluster successfully in Australia... Still not able to do so in the US.

zachberger7y ago

Have you tried via the gcloud command?

fizzledbits7y ago

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

And yet, the status page says all services are available.

dilyevsky7y ago

So, given that i filed this months ago via official support and it’s still not fixed, can you look into misleading container memory reporting ui bug. It reports memory_total but should be working_set

hacknat7y ago

Question to Google employees:

Why do you guys suffer global outages? This is your 2nd major global outage in less than 5 years. I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective. I need to see some blog posts about how you guys are rethinking whatever design can lead to this - twice - or you are never getting a cent of money under my control. You have the most feature rich cloud (particularly your networking products), but down time like this is unacceptable.

illumin87y ago

Google has a global SDN (software-defined network) that gives them some unique and beneficial capabilities, like being able to onboard traffic in the closest CDN POP and letting it ride over the Google backbone to the region your systems are running in.

The problem is that running a global SDN like this means if you do something wrong, you can have outages that impact multiple regions simultaneously.

This is why AWS has strict regional isolation and will never create cross-region dependencies (outside of some truly global services like IAM and Route 53 that have sufficient redundancy that they should (hopefully) never go down).

Disclaimer: I work for AWS, but my opinions are my own.

tehlike7y ago

2 outage in 5 years sounds pretty low, to be honest.

Disclaimer: google employee in ads, who worked on many many fires throughout the years, but talking from my personal perspective and not from my employer. I am sure we are striving to have 0, but realistically, i have seen many that says things happen. Learn, and improve.

noselasd7y ago

The issue people have with it is that it's global, not regional, indicating that there are dependencies in the entire architecture that people does not expect to be there.

3 more replies

yeukhon7y ago

5 years? I remember a major outage maybe in the past year.

1 more reply

origami7777y ago

Most feature rich cloud? I think that title belongs to AWS.

jkaplowitz7y ago

You're right in terms of breadth officially covered. But if you look at the features where they both officially have support, there are many examples where the GCP version is more reliable and usable than the AWS version. Even GKE is an example of this, despite the outage in node pool creation that we're discussing here. Way better than EKS.

(Disclosure: I worked for Google, including GCP, for a few years ending in 2015. I don't work or speak for them now and have no inside info on this outage.)

illumin87y ago

I think you're going to have to back up a claim like this with some facts.

GKE being the exception, since it was launched a couple years before EKS. AWS clearly has way more services, and the features are way deeper than GCP.

Just compare virtual machines and managed databases, AWS has about 2-3x more types of VMs (VMs with more than 4TB of RAM, FPGAs, AMD Epyc, etc.), and in databases, more than just MySQL and PostgreSQL. When you start looking at features you get features that you just can't get in GCP, like 16 read-replicas, point in time recovery, backtrack, etc.

Disclaimer: I work for AWS but my opinions are my own.

1 more reply

hacknat7y ago

Yeah. Perhaps feature rich was an overstatement. I meant that when GCP does do a product it works like I’d expect it to work and has the features I need. Not always the case with a AWS, particularly around ELBs and VPCs.

_wmd7y ago

It is a natural effect of building massive yet flat homogeneous systems, failures tend to be greatly amplified.

Most of what you can read of Google's approach will teach you their ideal computing environment is a single planetary resource, pushing any natural segmentation and partitioning out of view.

toomuchtodo7y ago

> I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective.

It's the opposite really: the expectation that service providers have no unexpected downtime is unrealistic, and it's strange this idea persists.

Twirrim7y ago

(disclaimer: I work for another cloud provider)

I agree, in general, outages are almost inevitable, but global outages shouldn't occur. It suggests at least a couple of things:

1) Bad software deployments, without proper validation. A message elsewhere in this post on HN suggest that problems have been occurring for at least 5 days, which makes me think this is the most likely situation. If this is the case, presumably given this is multiple days in to the issue, rolling back isn't an option. That doesn't say good things about their testing or deployment stories, and possibly their monitoring of the product? Even if the deployment validation processes failed to catch it, you'd really hope alarming would have caught it.

or:

2) Regions aren't isolated from each other. Cross-region dependencies are bad, for all sorts of obvious reasons.

2 more replies

manigandham7y ago

The major issue is that outages are global instead of regional, effectively making it impossible to design around using the typical region/zone redundancy.

tw047y ago

Because they sell themselves as being far more reliable than internal IT. If they weren't selling on uptime, people probably wouldn't be quite so critical of downtime.

1 more reply

rodgerd7y ago

The pitch from cloud vendors always includes the idea that the cloud is more reliable than any in-house shop can achieve. So the expectation is set by the vendors.

chumboslice7y ago

2 outages in 5 years.

5. Years.

Nothing to see here, move along.

talonx7y ago

2 "global" outages. If it had been limited to a service, or a region, there would be nothing to see.

1 more reply

hartem_7y ago

I’d be curious to know what alternatives are you considering at this point?

hacknat7y ago

Azure and AWS.

tomcam7y ago

I believe this is a fair question. I’d really like to understand what Google thinks about this.

asdfasgasdgasdg7y ago

Not to minimize here (well, yes, a little), but this was a UI-only outage, from what I can tell. You could still create the pools from the command-line. It doesn't seem unreasonable to have a single, global UI server, as long as the API gateway is distributed and not subject to global outages.

rlancerOP7y ago

Was certainly not UI only

1 more reply

scarface747y ago

Say I were a CTO (I’m nowhere near it), why would I choose GCP over AWS or Azure? Even if after doing a technical assessment and I thought that GCP was technically slightly better, if something happened, the first question I would be asked is “why did you choose GCP over AWS?”

No one would ever ask why you chose AWS. The old “no one ever got fired for buying IBM”.

Even if you chose Azure because you’re a Microsoft shop, no one would question your choice of MS. Besides, MS is known for thier enterprise support.

From a developer/architect standpoint, I’ve been focused the last year on learning everything I could about AWS and chose a company that fully embraced it. AWS experience is much more marketable than GCP. It’s more popular than Azure too, but there are plenty of MS shops around that are using Azure.

013a7y ago

- Native integration with G-Suite as an identity provider. Unified permissions modeling from the IDP, to work apps like email/Drive, to cloud resources, all the way into Kubernetes IAM.

- Security posture. Project Zero is class leading, and there's absolutely a "fear-based" component there, with the open question of when Project Zero discovers a new exploit, who will they share it with before going public? The upcoming Security Command Center product looks miles ahead of the disparate and poorly integrated solutions AWS or Azure offers.

- Cost. Apples to apples, GCP is cheaper than any other cloud platform. Combine that with easy-to-use models like preemptible instances which can reduce costs further; deploying a similar strategy to AWS takes substantially more engineering effort.

- Class leading software talent. Google is proven to be on the forefront of new CS research, then pivoting that into products that software companies depend on; you can look all the way back to BigQuery, their AI work, or more recently in Spanner or Kubernetes.

- GKE. Its miles ahead of the competition. If you're on Kubernetes and its not on GKE, then you've got legacy reasons for being where you're at.

Plenty of great reasons. Reliability is just one factor in the equation, and GCP definitely isn't that far behind AWS. We have really short memories as humans, but too soon we seem to forget Azure's global outage just a couple months ago due to a weather issue at one datacenter, or AWS's massive us-east-1 S3 outage caused by a human incorrectly entering a command. Shit happens, and it's alright. As humans, we're all learning, and as long as we learn from this and we get better then that's what matters.

majewsky7y ago

> If you're on Kubernetes and its not on GKE, then you've got legacy reasons for being where you're at.

Or you have legitimate reasons for running on your own hardware, e.g. compliance or locality (I work at SAP's internal cloud and we have way more regions than the hyperscalers because our customers want to have their data stay in their own country).

1 more reply

scarface747y ago

Your response is from a geek’s viewpoint. No insult, intended, I’m first and foremost a 30 year computer geek myself - started programming in 65C02 assembly in 6th grade and still mostly hands on.

But, whether it is right or not, as an architect/manager, etc, you have to think about what’s not just best technically. You also have to manage your reputational risks if things go south and less selfishly, how quickly can you find someone with the relevant experience.

From a reputation standpoint, even if AWS and GCP have the same reliability, no one will blame you if AWS goes down if you followed best practices. If a global outage of an AWS resource went down, you’re in the same boat as a ton of other people. If everyone else was up and running fine but you weren’t because you were on the distant third cloud provider, you don’t have as much coverage.

I went out on a limb and chose Hashicorp’s Nomad as the basis of a make or break my job project I was the Dev lead/architect for hoping like hell things didn’t go south and the first thing people were going to ask me is why I chose it. No one had heard of Nomad but I needed a “distributed cron” type system that could run anything and it was on prem. It was the right decision but I took a chance.

From a staffing standpoint, you can throw a brick and hit someone who at least thinks they know something about AWS or Azure GCP, not so much.

It’s not about which company is technically better, but I didn’t want to ignore your technical arguments...

Native integration with G-Suite as an identity provider. Unified permissions modeling from the IDP, to work apps like email/Drive, to cloud resources, all the way into Kubernetes IAM.

You can also do this with AWS - use a third party identity provider and map them to native IAM user and roles.

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_cr...

Cost. Apples to apples, GCP is cheaper than any other cloud platform. Combine that with easy-to-use models like preemptible instances which can reduce costs further; deploying a similar strategy to AWS takes substantially more engineering effort.

The equivalent would be spot instances on AWS.

From what (little) I know about preemptible instances, it seems kind of random when they get reassigned but Google tries to be fair about it. The analagous thing on AWS would be spot instances where you set the amount you want to pay.

Class leading software talent. Google is proven to be on the forefront of new CS research, then pivoting that into products that software companies depend on; you can look all the way back to BigQuery, their AI work, or more recently in Spanner or Kubernetes.

All of the cloud providers have managed Kubernetes.

As far as BigQuery. The equivalent would be Redshift.

https://blog.panoply.io/a-full-comparison-of-redshift-and-bi...

Reliability is just one factor in the equation, and GCP definitely isn't that far behind AWS

Things happen. I never made an argument about reliability.

2 more replies

halbritt7y ago

GCP has a few features that set it apart from other cloud providers. GKE is head and shoulders above the other offerings from AWS and Azure.

GCP can be a fair bit cheaper than AWS and Azure for certain workloads. Raw compute/memory is about the same. Storage can make a big difference. GCP persistent SSD costs a bit more than AWS GP2 with much better performance and way cheaper than IO2. Local SSD is also way, way cheaper than I2 instances.

Most folks deploying distributed data stores that need guaranteed performance are using local disk, so this can be a really big deal.

scarface747y ago

I have a more detailed post above, but if you are large enough, you’re not paying the listed price for AWS. But even if you are, prices change all of the time. From a completely selfish standpoint, is the price difference worth the cost to bet your reputation on if you are the one that made the final decision? Even if statistically the same could happen with AWS, no one would blame you for choosing AWS.

However, I could see doing a multicloud solution where I took advantage of the price difference for one project.

2 more replies

chimen7y ago

I chose gcloud for the easier admin interface. I somehow manage to separate all my resources and look at them based on project groups without having to know cryptic instance ids of 32 chars. Oh, and they had Kubernetes first and I jumped into that boat early.

scarface747y ago

Were you making that decision for a personal project or for something work related?

The AWS console is wildly inconsistent. I’ll give you that. But, any projects I am doing are usually defined by a Cloud Formation Template abd I can see all of the related resources by looking at the stack that was run.

Theoretically, you could use the stack price estimator, I haven’t tried it though.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

AlexB1387y ago

This has been going on longer than three days. We have been dealing with this exact issue since at least Monday (11/5) morning in us-central1.

splap7y ago

same here. using gcloud, not web console

fizzledbits7y ago

same here as well, my us-central1 instance still will not boot

marcinzm7y ago

>Nov 09, 2018 05:59

>We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.

Wait, did the people tasked with fixing this just take the weekend off?

justinsb7y ago

The incident with the UI (where we suggested using gcloud temporarily) was opened in https://status.cloud.google.com/incident/container-engine/18..., but then what sure looks to me like the same incident was closed in https://status.cloud.google.com/incident/container-engine/18....

My working assumption is that 18006 should have closed out 18005. But now it sounds like there's a different issue, which we're working to get to the bottom of.

jasonlotito7y ago

The people tasked with fixing this aren't the ones providing the updates.

INTPenis7y ago

Understandable but in my experience the incident manager assigned is still supposed to keep track of progress during weekends when you have a major incident.

And this is likely a major incident with significant customer impact.

The way google is handling all this gives a pretty poor impression. Seems like this kubernetes is just a PoC.

1 more reply

marcinzm7y ago

Fair point but still seems odd that the people providing updates took the weekend off during a large scale customer impacting issue. I'm sure all the people spending the weekend trying to mitigate the impact of this on their infrastructure would love to have timely updates.

2 more replies

trhway7y ago

They have whole book describing who fixes, provides updates, etc. Fun meditative reading while waiting for the outage to get fixed.

https://landing.google.com/sre/sre-book/chapters/managing-in...

Looks like this time Mary took the whole week off without telling Josephine :)

rlancerOP7y ago

Status page is inaccurate as issues doesn't only affect the web UI, the same operations are not functioning via the CLI.

pm907y ago

Its kinda strange that HN seems to be the most effective way to give feedback to Google Cloud :/

Draiken7y ago

I also find it weird that on HN where normally people are very skeptical of any argument without data backing it, when it comes to this outage, people are assuming everything written here affects everyone.

Perhaps some of the issues are localized? Perhaps it's even user error (it happens, you know?). But because a small amount of HN users say "it's everywhere!" then suddenly people reach for their pitchforks.

Sometimes we just don't have all the information.

breakingcups7y ago

Most transparent, at least.

kenan_warren7y ago

Yeah almost all regions and zones for any compute instance have been exhausted since about 1pm PST on Friday. I finally got one up last night on us-east1, but my older cluster is basically SOL until it's fixed on us-west1. It went down for an upgrade and never came back up because of the same resource issue.

camhutch7y ago

I just tried turning up my 1-node test cluster via terraform, and it worked fine. I would have thought the gcloud CLI would be using the same API.

I did this in the australia-southeast1-a zone.

base6987y ago

What operations? Status just shows node pool creation.

rlancerOP7y ago

Can not create a new Clusters or Node Pool and can not resize exiting Node Pools, as far as users are reporting it's happening in all regions too.

Error message when creating a new Cluster:

Deploy error: Not all instances running in IGM after 35m7.509000994s. Expect 1. Current errors: [ZONE_RESOURCE_POOL_EXHAUSTED]: Instance 'gke-cluster-3-pool-1-41b0abf8-73d7' creation failed: The zone 'projects/url-shortner-218503/zones/us-west2-b' does not have enough resources available to fulfill the request. Try a different zone, or try again later. - ; .

1 more reply

scarface747y ago

A generic question: Our company is completely dependent on AWS. Sure we have taken all of the standard precautions for redundancy, but what happened here could just as easily happen with AWS - a needed resource is down globally.

What would a small business do as a contingency plan?

eknkc7y ago

This might be an unpopular opinion but,

Going multi region on AWS should be safe enough.

If a multi region, multi service meltdown happens on AWS, it will feel like most of the internet has gone down to a lot of users. Being such a catastrophic failure, I bet the service will be restored pretty fast, not in 3 days.

You could go multi cloud though. But when half of the internet struggles to work correctly, I’d not feel too bad about my small business’ downtime.

thaumasiotes7y ago

> If a multi region, multi service meltdown happens on AWS, it will feel like most of the internet has gone down to a lot of users. Being such a catastrophic failure, I bet the service will be restored pretty fast, not in 3 days.

Additionally, from a "nobody ever got fired for buying IBM" perspective, you're unlikely to catch much blame from your users for going down when everyone else was down too.

hacknat7y ago

Yeah, AWS has never had a global outage, Google has had 2 now.

1 more reply

owenmarshall7y ago

> You could go multi cloud though.

Multi cloud is almost always more pain than gain. You’d spend time and effort abstracting away the value that a cloud provider brings in canned services.

Hell, multi region is often more than many workloads need.

1 more reply

mey7y ago

First, determine your tolerance. How much does downtime per min cost you? How long can you be down? A lot of the time it may be cheaper to apologize to your customers than build a truly reliable system.

Then start looking at points of failure and sort them based on severity and probability. Is your own software deployment going to generate more downtime per year than a regional aws outage?

There are formal academic ways to determine what your overall availability is, but don't have those on hand. Suffice to say, it takes significant research, planning, execution, and testing to ensure a target availability. (See Netflix https://medium.com/netflix-techblog/the-netflix-simian-army-... ) if someone says they have 99.9% or better up time, they had better have proof in my mind (or a fat SLA violation payout)

People outsource to cloud providers not because they are cheap, but because managing infra in house is hard. Also move fast and break things.

Read AWS docs about availability, there are availability zones in a region, spread across those to minimize impact. Then test when something goes down. Fix/repeat.

devonkim7y ago

People outsource to cloud providers because building / hiring / maintaining a team of decent engineers that provide a baseline industry bar of SLAs, SLOs is much more expensive than the eye watering costs of most cloud providers at even a IaaS level. Opex is tough.

Most companies I’ve been at don’t offer multi region support for their services because it’s too expensive for the service provided even in so-called “price insensitive” enterprises (you can’t just make up a price that’s huge, they do have budgets still) and most of their customers are unwilling / unable to pay more for the extra availability. If your software is designed from the start better, multi region failovers should be fairly inexpensive though. But all the bolted on “multi region” software I’ve seen has been hideously expensive and oftentimes less reliable due to the design being soundly not able to tolerate failures well.

2 more replies

gaius7y ago

If all your competitors are down too, the equation changes, no?

1 more reply

yjftsjthsd-h7y ago

Basically, you have three options. You can go full milti cloud, with all the expenses and overhead that entails. You can have everything run in one place but have a plan to switch to a backup system. Or you can look at the overhead and cost associated with those, and decide that it's not worth it. If the business can handle the costs and risks, then any of them can be a valid option.

dgemm7y ago

Assuming you have architected multi-region I'm not sure how realistic that scenario is. AWS regions are mostly standalone, I have seen services go down in a region but never globally.

scarface747y ago

The most recent example that I can think of when something went down globally was Route 53 - the one service that AWS promises 100% up time for.

2 more replies

apapli7y ago

Very little, unless that small business has a big enough budget to design something that spans multiple clouds.

Ultimately it’s a risk/return decision.

“Is going exclusively with AWS/azure/GCP etc a better decision in reliability, financial and mantainability terms than complicating the design to improve resiliency? And will this more complex solution actually improve reliability?”

user59944617y ago

You need to have a secondary location for backups, not AWS. Stores copie of customer data, orders, accounts, balances and anything that is critical to the business.

If AWS ever screws up, you will be able to continue running the business even if it might take weeks to start over.

For live redundancy, you should have a secondary datacenter on another provider, but realistically it's hard to do and most business never achieve that. Instead, just stick with AWS and if there is a problem the strategy is to sip coffee while waiting for them to resolve it. Much better this way than you having to fix it yourself.

viraptor7y ago

> What would a small business do as a contingency plan?

Depends on your definition of small. If it's small enough not to have a dedicated infrastructure team designing multicloud solution, then the contingency plan may be: switch DNS to a static site saying "we're down until AWS fixes the issue, check back later".

Otherwise it depends on your specific scenario, your support contracts, and lots of other things. You need to decide what matters, how much the mitigation costs vs downtime, and go from there.

dabei7y ago

Using Kubernetes is a good start. It should be easy to migrate your server between EKS and GKE. However data is trickier to move around. So you won’t be immune from all global outages.

aviv7y ago

We invested early in being multi-region on GCP as well as multi-cloud with AWS as a fully redundant option if it ever became necessary to fail over to them.

adevx7y ago

Paying off big-time now :-)

1 more reply

xchaotic7y ago

Roll the dice? What are the consequences for you, especially if you can shift the blame? What are the odds of you having better uptime rolling your own tooling? Can you afford the complexity of multi cloud? Is the added complication worth it?

tanilama7y ago

Unpopular opinion: very little perhaps. You have to make sure all your your 3rd party dependencies have the same contingency plan as you do, but I guess it is going to be difficult to even figure that out...

Mtinie7y ago

Multi-Cloud-Service-as-a-Service application redundancy wrappers.

I wish I was only being tongue-in-cheek.

halbritt7y ago

My entire production infrastructure is in GCP. What happened here has caused approximately zero impact to the availability of my service.

geggam7y ago

Infrastructure as code.

Terraform using AMIs plus chef recipes that work in the cloud and bare metal. Dont use AWS specific services.

This would allow you to spin over to another cloud provider , vsphere or bare metal with minimal work

cloakandswagger7y ago

Totally impractical for any small business and of questionable usefulness for large ones. You'd be giving up the largest benefit of platforms like AWS--ready to use services for common tasks--to avoid the infinitesimally small chance of AWS having some doomsday global outage.

To answer the original question: It looks like this issue was just a UI bug that affected the console, the service itself wasn't impacted. Events that do impact the service will be contained to a region, meaning you can mitigate it with proper redundancy across regions, no zany multi-cloud solution required.

1 more reply

scarface747y ago

That’s not how Terraform works. Each provisioner has separate syntax depending on the cloud provider. The template for AWS wouldn’t work on GCP or Azure.

1 more reply

xchaotic7y ago

I think you are downplaying minimal

2 more replies

scarface747y ago

And this is the reason that people end up spending so much more in the cloud - treating AWS like a glorified colo.

rlancerOP7y ago

UPDATE: Got some clarity, these issues are caused by "resource exhaustion" meaning there are no resources left to be allocated.

halbritt7y ago

I'm curious to see if this is true.

I faced some pretty serious resource allocation issues earlier in the year. The us-west1-a region was oversubscribed. I was unable to get any real information from support with regard to capacity. Eventually my rep gave me some qualitative information that I was able to act on.

7ewis7y ago

I honestly don't mind if providers have outages - we can't expect 100.00% accuracy, I know the systems I manage certainly don't achieve that.

One thing I do care about though, is root cause analysis. I love reading a good RCA, it restores my faith in the company and makes me trust them more.

(I'm not affect by the GKE outage so opinions may differ right now!)

locusm7y ago

Do not use GCP without paying for support. We have had resource allocation errors for weeks, as have a lot of other people. Check out the posts in their forum where folk on basic support get zero love. https://groups.google.com/forum/?utm_medium=email&utm_source...

thwy123217y ago

Been trying to spin up vm instances all day, had to try every single zone just to get one up. Not only is this incredibly harmful to a technology business dependent on this infra, it wasnt obvious to me what the issue was until I tried creating instances. Nothing says, hey resources are constrained here, try this one. Just about ready to bite the bullet and move to aws.

pfd19867y ago

Same here. We have spent 2 days trying to create instances and migrate images just to figure out later they can't start.

Right when I convinced our project to get migrated from AWS...

Masiosare7y ago

Same question... why would you do that? AWS is super stable most of the time. I have been running k8s over EC2 (not eks) for a year and works like a charm. I've even run experiments using spot instances and it's pretty good (no guarantee there of course).

1 more reply

tigershark7y ago

Why on earth would you do that, unless you had huge problems on aws?

sladey7y ago

Seems to be some weird underlying issue going on at GCP at the moment. Had cloud build webhooks returning a 500 error. Noticed we were at 255 images and deleting some fixed the issue. Created a P2 ticket about the issue before we managed to solve it and haven't had a response in 40+ hours.

The timeline of this disruption matches when we started experiencing cloud build errors.

lstamour7y ago

Outsider here, but I believe Cloud Build runs on GKE Jobs, so if they’re having trouble, it does indeed sound related.

ernsheong7y ago

"third consecutive day of service disruption" is not an accurate statement? Latest update was Nov 11 saying things resolved on Nov 9.

https://status.cloud.google.com/incident/container-engine/18...

ernsheong7y ago

If all nodes in GKE clusters were down for 3 days, I would consider this newsworthy and shocking. This... is not. Come on, people.

013a7y ago

Cloud providers have all of the potential in the world to make each region truly isolated. I shouldn't have to architect my application to be multi-cloud, at least for stability reasons.

Yet, somehow every major cloud provider experiences global outages.

That old AWS S3 outage in us-east-1 was an interesting one; when it went down, many services which rely on S3 also went down, in other regions beside us-east-1 because they were using us-east-1 buckets. I have a feeling this is more common than you'd think; globally-redundant services which rely on some single point of geographical failure for some small part.

threeseed7y ago

AWS regions are very much isolated from each other.

We know because we are still waiting here in ap-southeast-2 for services such as EKS to be made available. Pretty sure that any reliance within their backend services on us-east-1 was just a temporary bug and nothing systemic.

spiderPig7y ago

Our company is dependent on this as well and the way customer service has been handling this has been abysmal thus far.

qaq7y ago

There is no magic public clouds have incredibly complex control planes and marketing fluff aside you would very likely experience much better uptime at singe top tier DC than @ a cloud provider.

arunoda7y ago

The is not only GKE. But for GCE as well. I cannot create instance is almost all zones. I tried both preemptible and normal as well.

Always saying resource not available. My account is a pretty new account.

In contrast, one of my friend is having a pretty old account which is very active. He has no such issue.

So I think due to this issue, Google has enabled some resource limitation for new accounts.

But they should properly communicate this issue.

gigatexal7y ago

Oh man must be a tough time to be an SRE at google cloud. But... they’re Google. They have been doing internal cloud for years and years. Borg — which K8s is a reimplementation if — has been the heart of Google for so long now you’d think they’d be able to architect their systems to have no outages whatsoever. I mean nobody is perfect but this looks bad.

Jedi727y ago

Goes to show outsourcing infrastructure is more about blame shifting so that when things go wrong its "not our fault" than reducing actual downtime.

closeparen7y ago

Doesn’t GKE “just” run an independent Kubernetes cluster on customer VMs? How is a widespread outage like this possible?

regnerba7y ago

GKE does the creation of the VMs and setup of them, joining them to the cluster and applying labels for example.

The specific issue appears to be about creating new "node pools". Creating standard VMs in GCP works fine however, so this is specific to GKE and their internal tooling that integrates with the rest of GCP.

GKE doesn't (at least to my knowledge) allow you to create VMs separately and join them to the cluster in any kind of easy fashion.

kenan_warren7y ago

It's actually not just GKE, there have been issues creating normal VMs since late Friday night. It seems anything that required creating VMs gave back resource exhaustion errors. I finally got a cluster for us-east1 setup last night so it looks like the resource issues are clearing up though.

raincom7y ago

Nope, GKE = master/control plane owned by Google. Customers are just tenants, who can schedule workloads.

rlancerOP7y ago

GKE gives you a fully managed Master Node.

fizzledbits7y ago

As of this morning, I am still unable to reliably start my docker+machine autoscaling instances. In all cases the error is "Error: The zone <my project> does not have enough resources available to fulfill the request"

An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

And yet, the status page says all services are available.

Is the typical of others' experiences?

wijowa7y ago

Right now we're experiencing an issue where a small percentage of end users on our GKE site are getting super slow speeds. The issue is ISP related as they can switch to a 4G hot spot in the same location and get normal speeds... and inside our system the timing looks normal. So there's a slowdown either TO the load balancer or FROM the load balancer. Took a week to convince Google's support contractor to even believe it wasn't an issue with our site and their advice is generally along the lines of Turn it off and Turn it back on again (which might actually fix the problem) though that's easier said than done in GCP.

nielsole7y ago

I use preemptible machines in autodialing and for first time did not have any machines available for multiple hours yesterday. I am wondering whether this falls under the normal preemptible behaviour or this service degradation.

wb3tech7y ago

If anyone is interested, here is my documented experience with this issue. I freaking love GCP and GKE, although I have not production environment as it was a HA cluster in us-central1. Working federation now.

https://stackoverflow.com/questions/53244471/gke-cluster-won...

regnerba7y ago

Is this just about creating new pools? I haven't noticed an issue with our existing pools scaling.

rlancerOP7y ago

You were able to add more Nodes to you're pool? Are you using any auto scaling?

_wmd7y ago

When guerilla marketing backfires

bdibs7y ago

As someone currently trying to decide between GCP and AWS for a project, is this a regular occurrence?

And for those who have used both, which would you go with today?

franky_g7y ago

Had it affected all regions or just some?

Is there another status page Google? Coz the last update I'm looking at...is dated on the 9th..

justinsb7y ago

The general page is at https://status.cloud.google.com/; you can scroll down to see GKE, and my (unofficial) belief is that https://status.cloud.google.com/incident/container-engine/18... should have closed out https://status.cloud.google.com/incident/container-engine/18...

_If_ that's the case, something else is causing the error messages other people are seeing

fulafel7y ago

Offtopic but are there some documented exceptions to the "keep the original title" rule?

whatshisface7y ago

Why do cloud providers have more global outages than major flagship websites like google.com?

betaby7y ago

Whey don't run on the same infra. Amazon.com doesn't run on AWS.

talonx7y ago

On the contrary, it does. They made the transition gradually.

fergie7y ago

Things break after everybody has gone home on a Friday? 3 day disruption.

thomasfl7y ago

I'd like to upvote, but 666 points seemed relevant.

haosdent7y ago

Time to use Mesos.

shiftnight7y ago

I have a question. At what point does k8s make sense?

I have a feeling that a microservice architecture is overkill for 99% of businesses. You can serve a lot of customers on a single node with the hardware available today. Often times, sharding on customers is rather trivial as well.

Monolith for the win! Opinions?

_asummers7y ago

K8s is nice even without microservices. Yeah you don't get nearly the benefits you would in a microservice architecture, but I consider it a control plane for the infrastructure, with an active ecosystem and focus on ergonomics. If you have a really simple infrastructure, you will still need to script spinning up the VMs, setting up the load balancing, etc. but K8s gives you a homogenous layer upon which to put your containers. It's not too much of an overkill, especially with a hosted K8s from e.g. Google, AWS, and soon Digital Ocean and Scaleway.

Things like throwing another node into the cluster, or rolling updates are free, which you would otherwise need to develop yourself. All of that is totally doable, of course, but I like being able to lean on tooling that is not custom, when possible.

When your infrastructure does need to become more complicated, you're already ready for it. Even if I were only serving a single language, starting with a K8s stack makes a lot of sense, to me, from a tooling perspective. Yeah normal VMs might be simpler, conceptually, but I don't consider K8s terribly complicated from a user perspective, when you're staying around the lanes they intend you to stay in. Part of this may also be my having worked with pretty poor ops teams in the past, but I think K8s gives you a really good baseline that gives pretty good defaults about a lot of your infrastructure, without a lot of investment on your part.

That said, if you're managing it on a bare metal server, then VMs may be much easier for you. K8s The Hard Way and similar guides go into how that would work, but managing high availability etcd servers and the like is a bit outside my comfort zone. YMMV.

polemic7y ago

There's a huge range between monolith and microservice approach, and even a monolith will have dependent services. A simple web stack these days might include nginx, a database, a caching layer, some sort of task broker and then the 'monolith' web app itself. All of that can be sanely managed in k8s.

james-mcelwain7y ago

Right... IMO monolith is better understood as a reference to the data model than deployment topology. If you only have a single source of truth, then your application is naturally going to trend towards doing most of its business logic in one place. This still doesn't displace the need for other services like caching, async tasks, etc., that you identify.

scarface747y ago

I definitely wouldn’t be managing my own database or caching layer without a very good reason. I would use a managed service if I were using a cloud provider.

013a7y ago

I hate the word Microservice, so I'm just going to use the word Service.

Most monoliths software companies build aren't actually monoliths, conceptually. Let's say you integrate with the Facebook API to pull some user data. Facebook is, within the conceptual model of your application, a service. Hell, you even have to worry "a little bit" about maintaining it; provisioning and rotating API keys, possibly paying for it, keeping up to date on deprecations, writing code to wire it up, worrying about network faults and uptime... That sounds like a service to me; we're three steps short of a true in-house service, as you don't have to worry about writing its code and actually running it, but conceptually its strikingly similar.

Facebook is a bad example here. Let's talk Authentication. Its a natural "first demonolithized service" that many companies will reach to build. Auth0, Okta, etc will sell you a SaaS product, or you can build your own with many freely available libraries. Conceptually they fill the same role in your application.

Let's say you use Postgres. That's pretty much a service in your application. A-ha; that's a cool monolith you've got there, already communicating over a network ain't it. Got a redis cache? Elasticsearch? Nginx proxy? Load balancer? Central logging and monitoring? Uh oh, this isn't really looking like a monolith anymore is it? You wanted it to be a monolith, but you've already got a few networked services. Whoops.

"Service-oriented" isn't first-and-foremost a way of building your application. It's a way of thinking about your architecture. It means things like decoupling, gracefully handling network failures, scaling out instead of up, etc. All of these concepts apply whether you're building a dozen services or you're buying a dozen services.

Monolithic architectures are old news because of this recognition; no one builds monoliths anymore. It's arguable if anyone ever did, truly. We all depend on networked services, many that other people provide. The sooner you think in terms of networked services, the sooner your application will be more reliable and offer a superior experience to customers.

And then, it's a natural step to building some in-house. I am staunchly in the camp of "'monolith' first, with the intention of going into services" because it forces you to start thinking about these big networking problems early. You can't avoid it.

hacknat7y ago

This outage really doesn’t have much to do with K8s.

shiftnight7y ago

Maybe so, but you won't be affected by this outage if you never decided to deploy k8s in the first place.

Even if you deploy k8s privately, or over at Amazon, I think there's enough horror stories to make you think twice about the technology.

Then, if it isn't going to be k8s for microservices, what's a more reliable alternative?

nstart7y ago

As someone whose daily work happens on k8s, I'd say you better be paining a lot before you move to k8s. I take great care to avoid this, but if you aren't careful, you can end up "feeling" productive on k8s without actually being productive. K8s gives a lot of room for one to tweak workflows, discuss deployment strategies, security, "best practices", etc. And you can get things done reasonably fast. But that's like a developer working all day on fine tuning their editor and comparing and writing plugins and claiming that they are getting productive.

The key issue here is that k8s was written with very large goals in mind. That a small business can easily spin it up quickly and run a few microservices or even a monolith + some workers is just coincidental. It is NOT the design goal. And the result of that is that a lot of the tooling and writing around k8s reflects that. A lot of the advice around practices like observability and service meshes comes from people who've worked in the top 1% (or less) of companies in terms of computing complexity. What I'm personally seeing is that this advice is starting to trickle down into the mainstream as gospel. Which strangely makes sense. No one else has the ability to preach with such assurance because not many people in small companies have actually been in the scenarios of the big guns. The only problem is that it's gospel without considering context.

So at what point does k8s make sense? Only when you have answers to the following:

* Getting started is easy, maintaining and keeping up with the going ons is a full time job - Do you have at least 1 engineer at least that you can spare to work on maintaining k8s as their primary job? It doesn't mean full time. But if they have to drop everything else to go work on k8s and investigate strange I/O performance issues, are you ready to allow that?

* The k8s eco system is like the JS framework ecosystem right now - There are no set ways of doing anything. You want to do CICD? Should you use helm charts? Helm charts inherited from a chart folder? Or are you fine using the PATCH API/kubectl patch commands to upgrade deployments. Who's going to maintain the pipeline? Who's going to write the custom code for your github deployments or your brigade scripts or your custom in house tool? Who's going to think about securing this stuff and the UX around it. That's just CICD mind you. We aren't anywhere close to the weeds about deciding if you want to use ingresses vs Load balancers and how you are going to run into service provider limits on certain resources. Are you ready to have at minimum 1 developer working on this stuff and taking time to talk to the team about it?

* Speaking about the team, k8s and Docker in general is a shift in thinking - This might sound surprising but the fact that Jessie Frazelle (y'all should all follow her btw) is occasionally seen reiterating the point that containers are NOT VM's is a decent indicator that people don't understand k8s or Docker at a conceptual level. When you adopt k8s, you are going to pass that complexity to your developers at some point. Either that or your dev ops team takes on that full complexity and that's a fair amount to abstract away from the developers which will likely increase the work load of devops and/or their team size. Are you prepared for either path?

* Oh also, what do your development environments start to look like? This is partly related to microservices but are you dockerizing your applications to work on the local dev environment? Who's responsible for that transition? As much as one tries to resist it, once you are on k8s you'll want to take advantage of it. Someone will build a small thing as a microservice or a worker that the monolith or other services depend on. How are you going to set that up locally? And again, who's going to help the devs accumulate that knowledge while they are busy trying to build the product. (Please don't put your hopes on devs wanting to learn that after hours. That's just cruel).

I can't write everything else I have in mind on this topic. It'd go on for a long long time. But the common theme here is that the choice around adopting k8s is generally put on a table of technical pros and cons. I'd argue that there's a significant hidden cost of human impact as well. Not all these decisions are upfront but it is the pain that you will adopt and have to decide on at some point.

Again, at what point does k8s make sense? Like I said, you ideally should be paining before you start to consider k8s because for nearly every feature of k8s, there is a well documented, well established, well secured parallel that already exists in the myriad of service providers. It's a matter of taking careful stock of how much upfront pain you are trading away for pain that you WILL accumulate later.

PS - If anyone claims that adopting a newer technology is going to make things outright less painful , that's a good sign of immaturity. I've been there and I picture myself smashing my head into a table every now and then when I think of how immature I used to be. Apologies to people I've worked with at past jobs.

PPS - From the k8s site, "Designed on the same principles that allows Google to run billions of containers a week, Kubernetes can scale without increasing your ops team." <-- is the kind of claim that we need to take flamethrowers to. On paper, 1 dev with the kubectl+kops CLI can scale services to run with 1000's of nodes and millions of containers. But realistically, you don't get there without having incurred significantly more complex use cases. So no, nothing scales independently.

raarts7y ago

I fully agree with you, and personally have taken the path of using Docker Swarm as a step-up to k8s, as it was so much easier to get along with. I would certainly recommend this to smaller businesses.

vesak7y ago

>The k8s eco system is like the JS framework ecosystem right now - There are no set ways of doing anything.

Given how both the JS and devops worlds seems to be progressing, is there any reason to believe that this will change before the next thing comes and K8S becomes a ghost town?

bg247y ago

Very nicely written. While not a direct response to OP, you articulated some great points on k8s. k8s will naturally succeed as the future of data center orchestration as VM's give way to containers. But it is questionable if everyone needs it.

kazen447y ago

i agree with you on major points.

Also, migrating to microservices for existing services might not be worth it, especially if you don't operate at a massive scale.

Keep it simple stupid is still a solid design decision, despite all the microservice/container hype.

Most bussinesses only need a couple of servers that provide the service, spread redundantly with a HA capability.

aaaaaaaaaab7y ago

Daily reminder that there's no "cloud", just other people's computers. ( ͡° ͜ʖ ͡°)

spullara7y ago

If a hosting service is down and nobody uses it, is there really any disruption?

j / k navigate · click thread line to collapse

407 comments

shareometry7y ago

I am currently evaluating GCP for two separate projects. I want to see if I understand this correctly:

4) Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems.

5) Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.

6) Google says they'll provide some information when the next business day rolls around, roughly 4 days after the start of the problem.

I really do want to make sure I'm understanding this situation. Please do correct me if I got something wrong in this summary.

manigandham7y ago

When everything works, GCP is the best. Stable, fast, simple, reliable.

When things stop working, GCP is the worst. Slow communications and they require way too much work before escalating issues or attempting to find a solution.

laurencei7y ago

As someone who works for Government and Enterprise - all I care about sometimes is how a company behaves when everything goes wrong.

Being able to tell the Directors/Senior managers that issues have been "escalated" and providing regular updates are critical.

4 more replies

Osiris7y ago

"Support costs" calculation often doesn't include the costs of not having support.

When I worked at GoDaddy, there were around 2/3 of the company was customer support.

All of my interactions with Amazon support have been resolved to my satisfaction within 10 minutes or less.

Companies really ought to do the math on the value that comes from providing fast, timely, and easy (don't have to fight with them) customer support.

Google hasn't learned this lesson.

1 more reply

jgalentine0077y ago

With Dell you can certify with them so you can get replacement parts and such without the BS back and forth with some guy in india. Saves everyone time and money.

1 more reply

dvdgsng7y ago

> instead they keep sending tickets back for "more info"

2 more replies

softawre7y ago

Heh, your "test" reminds me of an old Hanselman article:

https://www.hanselman.com/blog/FizzBinTheTechnicalSupportSec...

ElBarto7y ago

To say "when it works it's stable and reliable" implies that it is neither...

1 more reply

dilyevsky7y ago

icelancer7y ago

I think a lot of people give a fuck here; I do, at least. Thanks for outlining it, these things are fascinating (to me anyway, who has never worked in IT/ops).

navinsylvester7y ago

We are GCP customers for the last couple of years. We use other cloud platforms(AWS, IBM, Oracle, OrionVM) too. We don't use GKE but use rancher/kubernetes combo on their standard platform.

rogerkirkness7y ago

I agree with this. Compared to AWS, when Google says it's down, it's down, and that's rare. When they say it's up, it's up.

vira287y ago

I use AWS free tier and get customer support through email, but thats not the case with GCP. Do they provide free email support?

If you are an early stage startup can you afford their 200/Month support, when your entire GCP bill is under $1. However, that doesn't mean you don't have to support them.

2 more replies

ToFab1237y ago

I don't understand why someone would choose to deploy anything mission critical without having an support contract with the ISP, the manufacturer of the the software etc.

3 more replies

lilbobbytables7y ago

You're doing me a scare. I'm in the evaluation phase with them. Maybe I'm missing something here, but this is not at all what the linked post says.

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

So, it's a UI console issue, it appears you can still manage

"Affected customers can use gcloud command [1] in order to create new Node Pools. [1]"

Similarly, it actually was resolved in Friday, but they forgot to mark it as so.

"The issue with Google Kubernetes Engine Node Pool creation through the Cloud Console UI had been resolved as of Friday, 2018-11-09 14:30 US/Pacific."

shareometry7y ago

I'm in the evaluation phase too. And I've found a lot to like about GCP. I'm hoping the problems are understandable.

haldora7y ago

I've been failing all weekend to create nodes in a GKE cluster through either the UI console or gcloud. Even right now I can't get any nodes to spin up.

Edit: I finally got my cluster up and running by removing all nodes, letting it process for a few minutes, then adding new nodes.

timdumol7y ago

We've had no issues deleting and creating node pools this weekend (on asia-east1-a). No other problems noticed either.

fizzledbits7y ago

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

And yet, the status page says all services are available.

raincom7y ago

If you run your own k8s on GCP, you are not going to be affected by GKE.

aviv7y ago

tejohnso7y ago

> For three whole days, it was questionable whether or not a user would be able to launch a node pool (according to the official blog statement)

What blog statement are you referring to? I don't see any such statement. Can you provide a link?

So it sounds like a web interface problem, not a severely limiting, backend systems problem with global scope.

> Some users here are reporting that other GCP services not mentioned by Google's blog are experiencing problems

I don't see much of that.

paulddraper7y ago

I believe the OP was referring to the very same blog (web log) you cited.

https://status.cloud.google.com/incident/container-engine/18...

"We are investigating an issue with Google Kubernetes Engine node pool creation through Cloud Console UI."

> So it sounds like a web interface problem, not a severely limiting

Depends who you as to whether this is "severely" limiting, but yes there is a workaround by using an alternate interface.

marcinzm7y ago

Right now we don't know. It's one of two possibilities from what I can tell:

b) People are having various unrelated issues with services that they're mis-attributing to a global service disruption.

johnpython7y ago

ernsheong7y ago

As if something like this has never happened to AWS?

2 more replies

halbritt7y ago

I'm not sure about the market share, but I agree with the last two sentences.

...and I'm a happy GCP customer.

rorykoehler7y ago

meow_mix7y ago

I think you're missing the portion about how it only appears to be the console ui, no?

ransom15387y ago

human_error7y ago

> When an entire region is down what I have noticed is that all things are fucked globally on aws.

Do you have an example on this?

2 more replies

usmannk7y ago

Jedi727y ago

They work for Google so obviously they are much smarter than you. If theres a problem its probably the customers fault. /sarcasm

mbrumlow7y ago

I was so mad to read that until you said /sarcasm :p

That being said I really do think there is a difference between who is working at google today and the google we all fell in love with pre-2008.

I am sure there are a amazing people still working at google, but nowhere near like it was.

/end rant

3 more replies

jkaplowitz7y ago

Many of the tier 1 GCP support reps work for external vendors nowadays, which is probably part of the problem.

The problem is simply about how Google scales their GCP support org. To be completely clear, AWS support is by and large not great either.

If you're a big or strategically important customer, of course, you can get a good response from either company.

2 more replies

lstamour7y ago

reaperducer7y ago

Again, because their support reps don't believe there's a problem.

Perhaps if you explained it on a whiteboard...

navinsylvester7y ago

In general GCP has quota limits so its expected of customers to catch 5xx and do exponential back off. But this info is not explicitly stated.

halbritt7y ago

Did they ask you for a screenshot?

That's my favorite.

StreamBright7y ago

I just took a screenshot of this and any time somebody asks my why AWS i will forward them. Thanks!

Jedi727y ago

"The data says engagement is down 46%, I think its time we drop the product."

- Someone at Google right now, probably.

justinsb7y ago

(I work at Google, on GKE, though I am not a lawyer and thus don't work on the deprecation policy)

chrisseaton7y ago

> Google may discontinue any Services or any portion or feature for any reason at any time without liability to Customer

for any reason

at any time

4 more replies

brian-armstrong7y ago

What happens when they suddenly deprecate the deprecation policy?

1 more reply

whydoineedthis7y ago

im pretty sure he just forget the /s (sarcasm) on his post, but this was pretty cool information anyway, so thanks!

davemp7y ago

1 more reply

justinsb7y ago

Hi - I work at Google on GKE - sorry about the problems you're experiencing. There's a lot of people inside Google looking into this right now!

antpls7y ago

stevehawk7y ago

1 more reply

carbocation7y ago

> I was unable to create any google compute instance today, not even a basic 1vcpu, on NA and Europe-west.

I've been creating GCP instances in us-central1-a and us-central1-c today without issue. Which zone were you using in NA?

I have been noticing unusual restarts, but I haven't been able to pin down the cause yet (may be my software and not GCP itself).

2 more replies

0xbadcafebee7y ago

> I thought the "cloud" was supposed to make this kind of failure impossible

aiisjustanif7y ago

> I thought the "cloud" was supposed to make this kind of failure impossible

If set up properly to be utilized correctly, yeah. But, it's not a perfect world though.

davemp7y ago

I’ll suggest considering whether entities enamored with centralizing ideals are more likely to fail to properly realize the robustness of a distributed system.

aviv7y ago

We have created GCE instances in several US regions without any issue today. Last one was 10 minutes ago in west2.

marcinzm7y ago

NicoJuicy7y ago

The default is : https://status.cloud.google.com/incident/container-engine/18...

People who respond here could be employees of Google, caring about it and respond here because they know it.

What he can mention ( a lot of people are working on it) is what you can suspect when something is going down. All other cloud providers do the same.

1 more reply

trhway7y ago

>really amateur for an organization the size of google.

rdtsc7y ago

Strangely and sadly with gmail account blocking and other such issues HN and Twitter is often better way to get Google's support than to contact support.

ben_jones7y ago

2 more replies

tomcam7y ago

rlancerOP7y ago

Creating clusters via the UI is still not working for me.

rlancerOP7y ago

UPDATE: Created a Cluster successfully in Australia... Still not able to do so in the US.

zachberger7y ago

Have you tried via the gcloud command?

fizzledbits7y ago

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

And yet, the status page says all services are available.

dilyevsky7y ago

hacknat7y ago

Question to Google employees:

illumin87y ago

The problem is that running a global SDN like this means if you do something wrong, you can have outages that impact multiple regions simultaneously.

Disclaimer: I work for AWS, but my opinions are my own.

tehlike7y ago

2 outage in 5 years sounds pretty low, to be honest.

noselasd7y ago

The issue people have with it is that it's global, not regional, indicating that there are dependencies in the entire architecture that people does not expect to be there.

3 more replies

yeukhon7y ago

5 years? I remember a major outage maybe in the past year.

1 more reply

origami7777y ago

Most feature rich cloud? I think that title belongs to AWS.

jkaplowitz7y ago

(Disclosure: I worked for Google, including GCP, for a few years ending in 2015. I don't work or speak for them now and have no inside info on this outage.)

illumin87y ago

I think you're going to have to back up a claim like this with some facts.

GKE being the exception, since it was launched a couple years before EKS. AWS clearly has way more services, and the features are way deeper than GCP.

Disclaimer: I work for AWS but my opinions are my own.

1 more reply

hacknat7y ago

_wmd7y ago

It is a natural effect of building massive yet flat homogeneous systems, failures tend to be greatly amplified.

Most of what you can read of Google's approach will teach you their ideal computing environment is a single planetary resource, pushing any natural segmentation and partitioning out of view.

toomuchtodo7y ago

> I’m sorry to say this, but it is the equivalent of going bankrupt from a trust perspective.

It's the opposite really: the expectation that service providers have no unexpected downtime is unrealistic, and it's strange this idea persists.

Twirrim7y ago

(disclaimer: I work for another cloud provider)

I agree, in general, outages are almost inevitable, but global outages shouldn't occur. It suggests at least a couple of things:

or:

2) Regions aren't isolated from each other. Cross-region dependencies are bad, for all sorts of obvious reasons.

2 more replies

manigandham7y ago

The major issue is that outages are global instead of regional, effectively making it impossible to design around using the typical region/zone redundancy.

tw047y ago

Because they sell themselves as being far more reliable than internal IT. If they weren't selling on uptime, people probably wouldn't be quite so critical of downtime.

1 more reply

rodgerd7y ago

The pitch from cloud vendors always includes the idea that the cloud is more reliable than any in-house shop can achieve. So the expectation is set by the vendors.

chumboslice7y ago

2 outages in 5 years.

5. Years.

Nothing to see here, move along.

talonx7y ago

2 "global" outages. If it had been limited to a service, or a region, there would be nothing to see.

1 more reply

hartem_7y ago

I’d be curious to know what alternatives are you considering at this point?

hacknat7y ago

Azure and AWS.

tomcam7y ago

I believe this is a fair question. I’d really like to understand what Google thinks about this.

asdfasgasdgasdg7y ago

rlancerOP7y ago

Was certainly not UI only

1 more reply

scarface747y ago

No one would ever ask why you chose AWS. The old “no one ever got fired for buying IBM”.

Even if you chose Azure because you’re a Microsoft shop, no one would question your choice of MS. Besides, MS is known for thier enterprise support.

013a7y ago

- Native integration with G-Suite as an identity provider. Unified permissions modeling from the IDP, to work apps like email/Drive, to cloud resources, all the way into Kubernetes IAM.

- GKE. Its miles ahead of the competition. If you're on Kubernetes and its not on GKE, then you've got legacy reasons for being where you're at.

majewsky7y ago

> If you're on Kubernetes and its not on GKE, then you've got legacy reasons for being where you're at.

1 more reply

scarface747y ago

From a staffing standpoint, you can throw a brick and hit someone who at least thinks they know something about AWS or Azure GCP, not so much.

It’s not about which company is technically better, but I didn’t want to ignore your technical arguments...

Native integration with G-Suite as an identity provider. Unified permissions modeling from the IDP, to work apps like email/Drive, to cloud resources, all the way into Kubernetes IAM.

You can also do this with AWS - use a third party identity provider and map them to native IAM user and roles.

https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_cr...

The equivalent would be spot instances on AWS.

All of the cloud providers have managed Kubernetes.

As far as BigQuery. The equivalent would be Redshift.

https://blog.panoply.io/a-full-comparison-of-redshift-and-bi...

Reliability is just one factor in the equation, and GCP definitely isn't that far behind AWS

Things happen. I never made an argument about reliability.

2 more replies

halbritt7y ago

GCP has a few features that set it apart from other cloud providers. GKE is head and shoulders above the other offerings from AWS and Azure.

Most folks deploying distributed data stores that need guaranteed performance are using local disk, so this can be a really big deal.

scarface747y ago

However, I could see doing a multicloud solution where I took advantage of the price difference for one project.

2 more replies

chimen7y ago

scarface747y ago

Were you making that decision for a personal project or for something work related?

Theoretically, you could use the stack price estimator, I haven’t tried it though.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

AlexB1387y ago

This has been going on longer than three days. We have been dealing with this exact issue since at least Monday (11/5) morning in us-central1.

splap7y ago

same here. using gcloud, not web console

fizzledbits7y ago

same here as well, my us-central1 instance still will not boot

marcinzm7y ago

>Nov 09, 2018 05:59

>We will provide more information by Monday, 2018-11-12 11:00 US/Pacific.

Wait, did the people tasked with fixing this just take the weekend off?

justinsb7y ago

My working assumption is that 18006 should have closed out 18005. But now it sounds like there's a different issue, which we're working to get to the bottom of.

jasonlotito7y ago

The people tasked with fixing this aren't the ones providing the updates.

INTPenis7y ago

Understandable but in my experience the incident manager assigned is still supposed to keep track of progress during weekends when you have a major incident.

And this is likely a major incident with significant customer impact.

The way google is handling all this gives a pretty poor impression. Seems like this kubernetes is just a PoC.

1 more reply

marcinzm7y ago

2 more replies

trhway7y ago

They have whole book describing who fixes, provides updates, etc. Fun meditative reading while waiting for the outage to get fixed.

https://landing.google.com/sre/sre-book/chapters/managing-in...

Looks like this time Mary took the whole week off without telling Josephine :)

rlancerOP7y ago

Status page is inaccurate as issues doesn't only affect the web UI, the same operations are not functioning via the CLI.

pm907y ago

Its kinda strange that HN seems to be the most effective way to give feedback to Google Cloud :/

Draiken7y ago

Sometimes we just don't have all the information.

breakingcups7y ago

Most transparent, at least.

kenan_warren7y ago

camhutch7y ago

I just tried turning up my 1-node test cluster via terraform, and it worked fine. I would have thought the gcloud CLI would be using the same API.

I did this in the australia-southeast1-a zone.

base6987y ago

What operations? Status just shows node pool creation.

rlancerOP7y ago

Can not create a new Clusters or Node Pool and can not resize exiting Node Pools, as far as users are reporting it's happening in all regions too.

Error message when creating a new Cluster:

1 more reply

scarface747y ago

What would a small business do as a contingency plan?

eknkc7y ago

This might be an unpopular opinion but,

Going multi region on AWS should be safe enough.

You could go multi cloud though. But when half of the internet struggles to work correctly, I’d not feel too bad about my small business’ downtime.

thaumasiotes7y ago

Additionally, from a "nobody ever got fired for buying IBM" perspective, you're unlikely to catch much blame from your users for going down when everyone else was down too.

hacknat7y ago

Yeah, AWS has never had a global outage, Google has had 2 now.

1 more reply

owenmarshall7y ago

> You could go multi cloud though.

Multi cloud is almost always more pain than gain. You’d spend time and effort abstracting away the value that a cloud provider brings in canned services.

Hell, multi region is often more than many workloads need.

1 more reply

mey7y ago

Then start looking at points of failure and sort them based on severity and probability. Is your own software deployment going to generate more downtime per year than a regional aws outage?

People outsource to cloud providers not because they are cheap, but because managing infra in house is hard. Also move fast and break things.

Read AWS docs about availability, there are availability zones in a region, spread across those to minimize impact. Then test when something goes down. Fix/repeat.

devonkim7y ago

2 more replies

gaius7y ago

If all your competitors are down too, the equation changes, no?

1 more reply

yjftsjthsd-h7y ago

dgemm7y ago

Assuming you have architected multi-region I'm not sure how realistic that scenario is. AWS regions are mostly standalone, I have seen services go down in a region but never globally.

scarface747y ago

The most recent example that I can think of when something went down globally was Route 53 - the one service that AWS promises 100% up time for.

2 more replies

apapli7y ago

Very little, unless that small business has a big enough budget to design something that spans multiple clouds.

Ultimately it’s a risk/return decision.

user59944617y ago

You need to have a secondary location for backups, not AWS. Stores copie of customer data, orders, accounts, balances and anything that is critical to the business.

If AWS ever screws up, you will be able to continue running the business even if it might take weeks to start over.

viraptor7y ago

> What would a small business do as a contingency plan?

Otherwise it depends on your specific scenario, your support contracts, and lots of other things. You need to decide what matters, how much the mitigation costs vs downtime, and go from there.

dabei7y ago

Using Kubernetes is a good start. It should be easy to migrate your server between EKS and GKE. However data is trickier to move around. So you won’t be immune from all global outages.

aviv7y ago

We invested early in being multi-region on GCP as well as multi-cloud with AWS as a fully redundant option if it ever became necessary to fail over to them.

adevx7y ago

Paying off big-time now :-)

1 more reply

xchaotic7y ago

tanilama7y ago

Mtinie7y ago

Multi-Cloud-Service-as-a-Service application redundancy wrappers.

I wish I was only being tongue-in-cheek.

halbritt7y ago

My entire production infrastructure is in GCP. What happened here has caused approximately zero impact to the availability of my service.

geggam7y ago

Infrastructure as code.

Terraform using AMIs plus chef recipes that work in the cloud and bare metal. Dont use AWS specific services.

This would allow you to spin over to another cloud provider , vsphere or bare metal with minimal work

cloakandswagger7y ago

1 more reply

scarface747y ago

That’s not how Terraform works. Each provisioner has separate syntax depending on the cloud provider. The template for AWS wouldn’t work on GCP or Azure.

1 more reply

xchaotic7y ago

I think you are downplaying minimal

2 more replies

scarface747y ago

And this is the reason that people end up spending so much more in the cloud - treating AWS like a glorified colo.

rlancerOP7y ago

UPDATE: Got some clarity, these issues are caused by "resource exhaustion" meaning there are no resources left to be allocated.

halbritt7y ago

I'm curious to see if this is true.

7ewis7y ago

I honestly don't mind if providers have outages - we can't expect 100.00% accuracy, I know the systems I manage certainly don't achieve that.

One thing I do care about though, is root cause analysis. I love reading a good RCA, it restores my faith in the company and makes me trust them more.

(I'm not affect by the GKE outage so opinions may differ right now!)

locusm7y ago

thwy123217y ago

pfd19867y ago

Same here. We have spent 2 days trying to create instances and migrate images just to figure out later they can't start.

Right when I convinced our project to get migrated from AWS...

Masiosare7y ago

1 more reply

tigershark7y ago

Why on earth would you do that, unless you had huge problems on aws?

sladey7y ago

The timeline of this disruption matches when we started experiencing cloud build errors.

lstamour7y ago

Outsider here, but I believe Cloud Build runs on GKE Jobs, so if they’re having trouble, it does indeed sound related.

ernsheong7y ago

"third consecutive day of service disruption" is not an accurate statement? Latest update was Nov 11 saying things resolved on Nov 9.

https://status.cloud.google.com/incident/container-engine/18...

ernsheong7y ago

If all nodes in GKE clusters were down for 3 days, I would consider this newsworthy and shocking. This... is not. Come on, people.

013a7y ago

Cloud providers have all of the potential in the world to make each region truly isolated. I shouldn't have to architect my application to be multi-cloud, at least for stability reasons.

Yet, somehow every major cloud provider experiences global outages.

threeseed7y ago

AWS regions are very much isolated from each other.

spiderPig7y ago

Our company is dependent on this as well and the way customer service has been handling this has been abysmal thus far.

qaq7y ago

There is no magic public clouds have incredibly complex control planes and marketing fluff aside you would very likely experience much better uptime at singe top tier DC than @ a cloud provider.

arunoda7y ago

The is not only GKE. But for GCE as well. I cannot create instance is almost all zones. I tried both preemptible and normal as well.

Always saying resource not available. My account is a pretty new account.

In contrast, one of my friend is having a pretty old account which is very active. He has no such issue.

So I think due to this issue, Google has enabled some resource limitation for new accounts.

But they should properly communicate this issue.

gigatexal7y ago

Jedi727y ago

Goes to show outsourcing infrastructure is more about blame shifting so that when things go wrong its "not our fault" than reducing actual downtime.

closeparen7y ago

Doesn’t GKE “just” run an independent Kubernetes cluster on customer VMs? How is a widespread outage like this possible?

regnerba7y ago

GKE does the creation of the VMs and setup of them, joining them to the cluster and applying labels for example.

GKE doesn't (at least to my knowledge) allow you to create VMs separately and join them to the cluster in any kind of easy fashion.

kenan_warren7y ago

raincom7y ago

Nope, GKE = master/control plane owned by Google. Customers are just tenants, who can schedule workloads.

rlancerOP7y ago

GKE gives you a fully managed Master Node.

fizzledbits7y ago

An instance in us-central1-a has refused to start since last Thursday or Friday.

I created a new instance in us-west2-c, which worked briefly but began to fail midday Friday, and kept failing through the weekend.

And yet, the status page says all services are available.

Is the typical of others' experiences?

wijowa7y ago

nielsole7y ago

wb3tech7y ago

https://stackoverflow.com/questions/53244471/gke-cluster-won...

regnerba7y ago

Is this just about creating new pools? I haven't noticed an issue with our existing pools scaling.

rlancerOP7y ago

You were able to add more Nodes to you're pool? Are you using any auto scaling?

_wmd7y ago

When guerilla marketing backfires

bdibs7y ago

As someone currently trying to decide between GCP and AWS for a project, is this a regular occurrence?

And for those who have used both, which would you go with today?

franky_g7y ago

Had it affected all regions or just some?

Is there another status page Google? Coz the last update I'm looking at...is dated on the 9th..

justinsb7y ago

_If_ that's the case, something else is causing the error messages other people are seeing

fulafel7y ago

Offtopic but are there some documented exceptions to the "keep the original title" rule?

whatshisface7y ago

Why do cloud providers have more global outages than major flagship websites like google.com?

betaby7y ago

Whey don't run on the same infra. Amazon.com doesn't run on AWS.

talonx7y ago

On the contrary, it does. They made the transition gradually.

fergie7y ago

Things break after everybody has gone home on a Friday? 3 day disruption.

thomasfl7y ago

I'd like to upvote, but 666 points seemed relevant.

haosdent7y ago

Time to use Mesos.

shiftnight7y ago

I have a question. At what point does k8s make sense?

Monolith for the win! Opinions?

_asummers7y ago

polemic7y ago

james-mcelwain7y ago

scarface747y ago

I definitely wouldn’t be managing my own database or caching layer without a very good reason. I would use a managed service if I were using a cloud provider.

013a7y ago

I hate the word Microservice, so I'm just going to use the word Service.

hacknat7y ago

This outage really doesn’t have much to do with K8s.

shiftnight7y ago

Maybe so, but you won't be affected by this outage if you never decided to deploy k8s in the first place.

Even if you deploy k8s privately, or over at Amazon, I think there's enough horror stories to make you think twice about the technology.

Then, if it isn't going to be k8s for microservices, what's a more reliable alternative?

nstart7y ago

So at what point does k8s make sense? Only when you have answers to the following:

raarts7y ago

vesak7y ago

>The k8s eco system is like the JS framework ecosystem right now - There are no set ways of doing anything.

Given how both the JS and devops worlds seems to be progressing, is there any reason to believe that this will change before the next thing comes and K8S becomes a ghost town?

bg247y ago

kazen447y ago

i agree with you on major points.

Also, migrating to microservices for existing services might not be worth it, especially if you don't operate at a massive scale.

Keep it simple stupid is still a solid design decision, despite all the microservice/container hype.

Most bussinesses only need a couple of servers that provide the service, spread redundantly with a HA capability.

aaaaaaaaaab7y ago

Daily reminder that there's no "cloud", just other people's computers. ( ͡° ͜ʖ ͡°)

spullara7y ago

If a hosting service is down and nobody uses it, is there really any disruption?

j / k navigate · click thread line to collapse