Geico's IT will slow to a crawl in the next years due to the immense madness of supporting Kubernetes on top of OpenStack on top of Kubernetes (yes, that's what they are doing).
OpenStack's services are running in Kube? And Kube itself is ran as an OpenStack thing? Why? Why not use the same tooling used to deploy that initial Kube to deploy as many as needed? Still a massive maintenance burden, but you don't need to add OpenStack into the mix.
You can have a large Kubernetes cluster running OpenStack, because it's probably the easiest way to deploy and maintain OpenStack. You then build smaller, isolated Kubernetes clusters on top of OpenStack, using VMs.
It's not as crazy as it sounds, but it does feel a little unnecessarily complex.
1. we have a management k8s cluster where we deploy app blueprints
2. the app blueprints contain, among other things, specifications for VMs to allocate, which get allocated through an OpenStack CRD controller
3. and those VMs then get provisioned as k8s nodes, forming isolated k8s clusters (probably themselves exposed as resource manifests by the CRD controller on the management cluster);
4. where those k8s nodes can then have "namespaced" (in the Linux kernel namespaces sense) k8s resource manifests bound to them
5. which, through another CRD controller on the management cluster and a paired CRD agent controller on in the isolated cluster, causes equivalent regular resource manifests to be created in the isolated cluster
6. ...which can then do whatever arbitrary things k8s resource manifests can do. (After all, these manifests might even include deployments of arbitrary other CRD controllers, for other manifests to rely upon.)
All said, it's not actually that braindead of an architecture. You might better think of it as "k8s, with OpenStack serving as its 'Container Compute-Cluster Interface' driver for allocating new nodes/node pools for itself" (the same way that k8s has Container Storage Interface drivers.) Except that
1. there isn't a "Container Compute-Cluster Interface" spec like the CSI spec, so this needs to be done ad-hoc right now; and
2. k8s doesn't have a good multi-tenant security story — so rather than the k8s nodes created in these VMs being part of the cluster that spawned them, their resources isolated from the management-layer resources at a policy level, instead, the created nodes are formed into their own isolated clusters, with an isolated resource-set, and some kind of out-of-band resource replication and rewriting to allow for "passive" resources in the management cluster that control "active" resources in the sandboxed clusters.
Dios mio mayne
Any bets on what's going to happen next?
If you have ever seen a data center from Azure, GCP or AWS, you will realize how difficult it will be for any company to compete in the long run. Those companies develop new generations of data center infrastructure with power efficiency improvements every single year. They negotiate network and power contracts at a scale that exceeds any typical Fortune 500 company. I'm skeptical that running your own data center will end up a cost saver in the long run.
..and then mark it up. AWS overall has 38% operating margin[0]. Depending on your application this can hit you really hard (cloud egress bandwidth being an especially obscene offender).
> I'm skeptical that running your own data center will end up a cost saver in the long run.
It's not cloud -or- your own Azure-scale datacenter. There are any number of approaches in between including hybrid to offload stuff like CDN, storage, edge services, etc to cloud but the fact remains many companies can run the entire business from a few beefy machines in co-location facilities. Most companies, solutions, etc are not actually Google, Snapchat, Geico, etc scale and never will be.
Throw in some minor accounting tricks like leasing (with or without Section 179) and these kinds of "creative" approaches are often impossible to beat from a pricing/performance and even uptime standpoint. That's certainly been my experience.
[0] - https://www.theinformation.com/articles/why-aws-fat-margins-...
Someone in the c-suite gets a massive bonus before moving to a new company.
It is not more secure, I read every quarter about downtime events, and more importantly you have 0 control of your costs.
Your company is likely not Amazon, you will do fine if you have your on prem computers.
What you're referring to is mostly about elasticity, and it's true that if you don't need it, it doesn't make sense to pay for it.
But that doesn't mean that on-prem (which almost always turns into a virtual machine shitshow with crappy network design -- which will continue as long as nobody implements things like strong IAM and Security Groups in their on-prem setups) is 'the same' as cloud but just in a physical location you control.
The inverse is also true. If you just run some VMs 'in the cloud', you're doing it wrong. Playing datacenter is just as bad as not moving away from classic virtual machines, cloud or no cloud.
I don't see that much difference compared to doing actual admin tasks.
I don't know, I've seen the shittiest stuff built on-prem and in cloud, and I've seem completely amazing on-prem infrastructure and cloud stuff that could not possibly be built outside AWS.
Of course even in the cloud you still need to apply security patches to everything. However it still saves a lot of issues and thus money in all but the largest setups.
Many data centers offer remote hands services. And I don't believe this is at all true.
I worked at a place that managed thousands of boxes in dozens of pops with 1.5 fulltime people. If you design it for this from the beginning, with cattle not pets and netboot everywhere, this is very doable. And a large cost savings vs cloud.
The cost of getting things wrong with on-prem aren't high on the average - but they sure are spikey if you get unlucky.
If a hardware failure causes downtime you're doing it wrong. Additionally, big cloud scaring people from hardware with marketing and FUD has been very effective. Modern hardware is insanely reliable and performant - I don't think I've seen a datacenter/enterprise NVMe drive fail yet. It's not 2005 with spinning disks and power supplies blowing up left and right anymore.
> With 1 person that person will sometimes be on vacation when a zero day takes you down. With 2 people 1 will be on vacation when the second gets sick. You end up needing at least 5 people before you have enough people that you have redundancy for humans issues and the ability to train people in whatever is the latest needed.
Hardware vendors (Dell, etc) have highly-discounted warranty services. In the event of a hardware failure you open a ticket and they dispatch someone directly to the facility (often within hours by SLA) and it gets handled.
Same thing for shipping HW directly to co-lo and they rack/cable/bootstrap for a nominal fee, remote hands for weird edge-cases, etc.
A lot of takes here and elsewhere seem to be either big-cloud or Meta-level datacenter. I have operated POPs in a dozen co-location ("datacenter") facilities (a cabinet or two each) no one on staff ever stepped foot in with hardware we owned (and/or financed) that no one ever saw or touched. We operated this with two people looking after it as part of their broader roles and responsibilities and frankly they didn't have much to do.
There is an entire industry that provides any number of highly flexible and cost-effective approaches for everything in between.
That server is the main database. And yes, there is a backup server, but for reasons, the backup server isn't working as expected. So if that main server's RAM failed for good, there goes our product, for god knows how long, considering how long it's taken so far to get a second one set up.
You don't have to deal with any of that shit in the cloud. None. You just spin up a new server in 2 seconds. You don't deal with shitty hardware, or the differences between old and new hardware (besides cpu arch, and some special classes), or incompatibilities, or running out of space, or getting smart hands in your rack, or a million other things.
And that's just the hardware side. The software side of the cloud is the one million unique hosted services they offer that you can just start using immediately. No server set-up, no configuration management, it already has security baked in, it's already integrated with the other million services, etc. You just start using it, immediately, and it just works. It saves you time, complexity, maintenance, and it gives you reliability, compatibility, flexibility, and allows you to ship something earlier.
I have managed servers on-prem for years, for tiny startups and huge companies. Both two decades ago, and two years ago. Without a doubt, I would always suggest any kind of hosted, cloud-style vendor over on-prem. Only somebody needs to be on-prem, or they literally are a teenager with no money at all and all the time in the world to waste DIYing, then I would tell them to go on-prem.
Professional developers these days are primarily concerned with 1) getting their service running 2) as quickly as possible 3) someplace where they have instant access and control of it. Clicking around a cloud console accomplishes all three of these and allows you to write "Delivered the ____ service in 3 months that generates $XX M/year" on a performance review in short order. Having to build, rack, and configure a physical server or deal with "IT" (which has somehow become something separate from software engineering) does not. Because the developers are the ones delivering value they get to decide how it's done. AWS gets it done. A server in a datacenter in Texas that requires an SSH keypair to reach doesn't.
Your average SDE L4 does know or care about init systems or SANs or colos or 802.1q or any of the myriad of things required to run on-prem infra. They write software. Software makes money and so the business makes money - wash, rinse, repeat. Why would you have people on the front lines of your revenue stream worrying about these things when you can have a hyperscaler with a control plane do it for a nominal fee?
Lots of data processing workloads don't need to be run constantly, but do need to be run in a shorter amount of time. Cloud is pretty good for that sort of thing.
Because you're not a startup you there is a very good chance that you have a very process-driven (cover your ass), slow-moving culture - this very often translates to an IT department where getting even basic things done (like reserving extra compute or changing a network setting or starting to use a third party software) takes months of waiting or pleading. Maybe you have never encountered this kind of pathological IT department, but they're very common, and it's a major reason executives bought into cloud to begin with. Of course, many companies like Geico seem to have merely replicated their IT pathologies in the cloud, but at least in the cloud you have fewer sources of problems in areas like physical space management, buying/integrating hardware to grow or change your footprint and dealing with all the SKUs and supply chain problems therein, or negotiating on-prem licences.
There are many more moving pieces when operating on-prem: more operations staff across more kinds of roles (yes, you still have eg devops people when using the cloud, but you don't need as many building operations staff (where managing a datacenter is its own speciality), people managing hardware/software vendors and related supply chain issues, people skilled in physical networking, people to plug things in/out and physically operate the machines), managing and acquiring the physical space where your on-prem setup is, buying/accounting for all the different kinds of hardware you need, licensing/using more software with more difficult integration to achieve equivalent functionality to eg EC2, licensing all your 3P software to run on-prem... even if nominally less expensive than the cloud in some cases, there are many more places where things can go wrong. That's not as easy to account for in a direct TCO comparison because it manifests as slowing things down - which does introduce very substantial costs - and distracting management away from other opportunities to grow revenue or improve costs.
Also, cloud downtime is really overstated as a problem in 2024. It makes the news because it has a high blast radius and involves high profile companies, not because it's more common than on-prem. With the exception of AWS us-east1 issues (which can break many AWS products at once across the world), most cloud reliability issues these days are isolated to only a few products and only a few regions. I think a lot of small on-prem companies don't realize that they are not actually more reliable, but just operate at a smaller scale where the probability of downtime causes "lucky streaks" to be more common (ie if you play roulette for three rounds, you're much more likely to have an abnormally high win rate than someone who plays it for three hundred rounds, even though you both have the same odds). Most companies don't have as mature security/risk operations as cloud providers and so face an existential risk/the possibility of huge (months) of downtime in the event of a fire/natural disaster at their dc, cryptolocker attack, janitor unplugging the server that says "do not unplug" - this isn't something people have to worry about with cloud providers to nearly the same extent.
Staff costs too high? Outsource. Opex too high? Insource.
You can spend a career jumping among companies swinging the pendulum back and forth.
I must admit. The computer was never the part of software that interested me.
This has been the story for 20 years now. Not even exaggerating. We all knew it was expensive from the get-go because we all did things on prem.