gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.
In my company we are split in 3, US, EU, APAC, and we have the same issue with global outage for stuff we could have just managed regionally. For all the savings of the global architecture, they disappear each minute a client is down on a global outage because a guy thousands of kms away messed up.
You dont have to unify, at all. You dont unify with your competitors, and the world has not exploded: compete internally between regions ?
As far as global services go though, it's easy enough to say "it should just not be possible", but how do you propose doing that in practice for a global service?
How does new config going to go out, globally, without being global? How do global services work if they're not global? How does DDoS protection work if you don't do it globally?
People make fun of "webscale" but operating Google is really difficult and complicated!
As I understand it, GCP is already designed to make global outages impossible. Obviously this outage shows that they messed up somehow and some global point of failure still remains. Looking forward to the post-mortem.
How much more work would Google create for themselves if they had not globalized their stack? Are we talking something like 5 subsets to manage instead of 1?
Ex-googler, no particular knowledge of this event, information might be out of date.
Of course, if you deploy a change to all of your separated stacks at once through some sort of automated pipeline it doesn’t matter too much. Easy to break everything simultaneously that way if there’s some difference between test and prod you didn’t realize was there.
I reckon the only to achieve that would be to have the same level of interoperability between regions as you would get between two distinct cloud providers.
Of course, at Google scale 'partial' is still very big.
And why enterprises clamoring for AWS to feature match Google's global stuff (theoretically making I.T. easier) instead of remaining regionally isolated (making I.T. actually more resilient, without extra work if I.T. operators can figure out infra-as-code patterns) should STFU and learn themselves some Terraform, Pulumi, or etc.
Also, AWS, if you're in this thread, stop with the recent cross-region coupling features already. Google's doing it wrong, explain that, and be patient, the market share will come back to you when they run out of the GCP subsidy dollars.
> gcp should be designed in a way where the term “global outage” isn’t a word in their vocabulary.
If that's what you really need, then distribute your assets across GCP, AWS, and DO. That likely means not using any cloud-specific features such as Lambda. AWS is actually really good in this regard, as SES and RDS are easily copied to regular instances in other cloud providers, that possibly wrap some cloud-specific feature themselves.Listening to the “All In Podcast” yesterday even those guys were talking about revenue drops in the big cloud services and noting we’re currently in the midst of a swing back to self-hosting/co-location/whatever thinking and migrations out.
IMHO those building greenfield solution today should take a hard look at whether the default approach from the last ~10 years “of course you build in $BIGCLOUD” makes sense for the application - in many cases it does not.
It also has the added benefit of de-centralizing the internet a bit (even if only a little).
I would be hesitant to attribute slowed growth to a return to self hosting, it's much more likely that it's caused by companies dialing back their cloud growth after spending a few years going ham digitizing everything during the pandemic.
From that pov I expect my platform to behave like a utility (never change or only change with strict backward compatibility). That level of control simply is against the business model of the cloud.
That said I think the point generally remains - one could argue slower than expected growth in cloud services is a revenue drop (in a way) vs expectations. The market responded accordingly[0] - "However, Azure growth is decelerating." Note that this is all including the explosion in "2023 AI hotness" which is almost certainly offsetting what would be larger losses due to the shift I'm arguing. As the All In Guys noted "you won't see a pitch deck without the letters AI in it" - and a good chunk of that is still going to cloud providers as (in my opinion) there are long tails to these changes and many existing solutions/applications getting "AI" slapped on them are effectively trapped in $BIGCLOUD.
Self-hosting AI is also significantly more difficult and upfront more expensive when you start looking at dealing with (typically) Nvidia hardware costs and software stack complexity. I can definitely see many of these "pivots" to "something AI, we need to throw AI in this" the more well understood and initially faster and "cheaper" utilization of cloud services will continue until the AI trend stabilizes.
From what I could hear (and process) over the screaming the All In Guys presented the argument I tend to agree with - a resurgence of self-hosted infrastructure.
Companies are also dialing back cloud spend because they're realizing for many applications it's very expensive relatively and can actually be limiting compared to self-hosting[1]. Per usual when the cheap money and economic boom retracts they start actually looking at costs they were once happy to just keep writing checks for.
I'd like to reiterate there's a lot of calculation and strategy when it comes down to selecting infrastructure hosting. Again, I think we're in a period where there's a bit of a sea change/wakeup from the past decade of "of course you always build and host everything in $BIGCLOUD" - without even remotely considering alternatives. It's been the default for a while and it isn't as much anymore - and I'd argue that trend is accelerating. There is no "one size fits all".
[0] - https://www.investors.com/news/technology/msft-stock-microso...
[1] - https://www.linkedin.com/pulse/snapchat-earnings-case-runawa...
You build greenfield in cloud precisely because it is greenfield and the utilization isn't well understood. Cloud options let you adjust and experiment quickly. Once a workload is well understood it's a good candidate for optimization, including a move to self managed hardware / on prem.
Buying hardware is a great option once you actually understand the utilization of your product. Just make sure you also have competent operators.
I would argue as the AI trend (eventually) wanes and many AI startups and projects within existing companies inevitably eventually fail to materialize the much longer and more general trend of migration out of $BIGCLOUD will be more drastic and obvious.
I don't buy individual stocks but I would happily bet a dinner on big cloud growth showing substantial reductions/losses in coming years as the overall situation stabilizes.
When one buys a house, they should take a hard loo at whether the default approach of paying for utilities makes sense, versus generating their own power.
While that's a bit snarky, the reasoning is similar. You can:
* Use "bigcloud"(TM) with the whole kit: VMs, their managed services, etc * Use bigcloud, but just VM or storage * Rent VMs from a smaller provider * Rent actual servers * Buy your servers and ship to a colo * Buy your servers and build a datacenter
Every level you drop, you need more work. And it grows(I suspect, not linearly). Sure, if you have all the required experts (or you rent them) you can do everything yourself. If not, you'll have to defer to vendors. You will pay some premium for this, but it's either that, or payroll.
What also needs to be factored in is how static your system is. If a single machine works for your use-case, great.
One of the systems I manage has hundreds of millions of dollars in contracts on the line, thousands of VMs. I do not care if any single VM goes down; the system will kill it and provision a new one. A big cloud provider availability zone often spans across multiple datacenters too, each datacenter with their own redundancies. Even if an entire AZ goes down, we can survive on the other two (with possibly some temporary degradation for a few minutes). If the whole region goes down, we fallback to another. We certainly don't have the time to discuss individual servers or rack and stack anything.
It does not come cheap. AWS specifically has egregious networking fees and you end up paying multiple times (AZ to AZ traffic, NAT gateways, and a myriad services that also charge by GB, like GuardDuty). It adds up if you are not careful.
From time to time, management comes with the idea of migrating to 'on-prem', because that's reportedly cheaper. Sure, ignoring the hundreds of engineers that will be involved in this migration, and also ignoring all the engineers that will be required to maintain this on-premises, it might be cheaper.
But that's also ignoring the main reason why cloud deployments tend to become so expensive: they are easy. Confronted with the option of spinning up more machines versus possibly missing a deadline, middle managers will ask for more resources. Maybe it's "just" 1k a month extra (those developers would cost more!). It gets approved. 50 other groups are doing the same. Now it's 50k. Rinse, repeat. If more emphasis would be placed into optimization, most cloud deployments could be shrunk spectacularly. The microservices fad doesn't help(your architecture might require that, but often the reason it does is because you want to ship your org chart, not for technical reasons).
Yes, people do. They install solar panels and use them to generate at least some of their own power. Near future battery tech might allow them to generate all of it if they get enough sunlight, in which case this will become a genuine question to answer: how much to install and maintain the panels and batteries over their lifetime, vs expected cost of purchasing power from utilities.
In a similar manner, cloud vs self hosting is a valid consideration that changes over time. We now have docker and similar tools which make managing your own infrastructure much easier than it was ten years ago. I fully expect even better tools will come out in the future so this consideration does change over time. Maybe in another ten years there'll be almost no benefit to using the cloud (except maybe as a CDN).
<3 To the engineers trying to fix it at the moment.
If one provider is down more than the others, the criticism is not only valid, it results in real business loss for the provider and its customers.
On multi-cloud: it's one way to reduce the amount of downtime you have, but it comes with a significant operational cost depending on how your application is architected and how your teams internal to your company are formed. It is totally practical for someone to bank on AWS' reliability until they're at a significant amount of traction or revenue where the added uptime of going multicloud is worth the investment. I know you're not saying this isn't the case (I think you're saying "do that if you're going to complain about 1 providers' uptime"), but thought it was worth putting the context into the HN ether.
Multi-cloud is saying you think you can manage Kafka across two or three clouds better than GCP can manage Pub/Sub.
Er, we absolutely can and should compare rates of problems and overall reliability.
If you run your own hardware these events are inevitable too.
I know it's just a psychological thing about giving up "control", but I have to stifle a chuckle every time.
Is anyone tracking reliability for these public providers? Would be curious how AWS compares to Azure and GCP. My experience is it's better, but we may have avoided Kinesis or whatever that keeps going down.
in multiple datacenters?
Not great, not terrible.
Let's take a gander at incident history: https://status.cloud.google.com/summary
Cloud Build looks bad... three multi-hour incidents this year, four in fall/winter last year.
Cloud Developer Tools have had four multi-hour incidents this year, many last fall/winter.
Cloud Firestore looks abysmal... Six multi-hour incidents this year, one of them 23 hours.
Cloud App Engine had three multi-hour incidents this year, many in fall/winter last year.
BigQuery had three multi-hour incidents this year, many in fall/winter last year.
Cloud Console had five multi-hour incidents this year, many in fall/winter last year. (And from my personal experience, their console blows pretty much all the time)
Cloud Networking has had nine incidents this year, one of them was eight days long. What the fuck.
Compute Engine has had five multi-hour incidents this year, many last fall/winter.
GKE had 3 incidents this year, multiple the past winter.
Can somebody do a comparison to AWS? This seems shitty but maybe it's par for the course?
This is a pretty reductionist summary, e.g. the 8-day Cloud Networking incident root cause:
> Description: Our engineering team continues to investigate this issue and is evaluating additional improvement opportunities to identify effective rerouting of traffic. They have narrowed down the issue to one regional telecom service provider and reported this to them for further investigation. The connectivity problems are still mostly resolved at this point although some customers may observe delayed round trip time or longer latency or sporadic packet loss until fully resolved.
Still a big problem product-wise, but you're looking at a global incident history view without any region/severity filters.
The corresponding AWS service health dashboard makes it much harder to view this level of detail, but is also actually useful for someone asking "is product $xyz which I depend on in region $abc currently down or not"
(full disclosure, work at Google but not on cloud stuff)
https://arstechnica.com/tech-policy/2023/02/us-says-google-r...
They claim the Gmail specific issues are resolved. We shall see...
Feb 27, 2023 2:03 PM UTC We experienced a brief network outage with package loss, impacting a number of workspace services. The impact is over. We are investigating and monitoring.