What I can tell you, is that the unbelievable bloat in the complexity of our systems is going to bite us in the ass. I'll never forget when I joined a hip fintech company, and the director of eng told us in orientation that we should think of their cloud of services as a thousand points of light, out in space. I knew my days were numbered at exactly that moment. This company had 200k unique users, and they were spending a million dollars a month on CRUD. Granted, banking is its own beast, but I had just come from a company of 10 people serving 3 million daily users 10k requests a second for images drawn on the fly by GPUs. Our hosting costs never exceeded 20k per month, and the vast majority of that was cloudflare.
Deploying meant compiling a static binary and copying it to the 4-6 hardware servers we ran in a couple racks, one rack on each side of the continent. We were drunk by 11am most of the time.
Today, it's apparently much more impressive if you need to have a team of earnest, bright-eyed Stanford grads constantly tweaking and fiddling with 100 knobs in order to keep systems running. Enter kubernetes.
My favorite example of this right now is Vitess. Sure, it's a beautiful piece of technology. But, for a usecase my company is looking at, we'll be replacing one (exceptionally large) DB with in excess of 80 mysql pods, managed by another opaque-through-complexity system running on the top of kubernetes (which already bites us regularly even though it's "managed").
The complexity and failure scenarios makes my head ache, even though I should never have to interact with it myself.
Oh, and my current favorite PITA - having to change the API version of deployment objects from 'v1beta1' to 'v1' in over 160 microservice charts as part of a kubernetes version upgrade. Helm 2 doesn't recognize the deployments as being identical, so we're also have to do a helm3 upgrade as well, just to avoid taking down our entire ecosystem to do the API version upgrade. Wheeee!
How is this a problem unique to Kubernetes? Don't you have to make similar changes when upgrading a library or dependency that was in beta?
But that's a moot point anyways, since Vitess doesn't use persistent volumes - it reloads the individual DBs from backups and binlogs when a pod is moved or restarted.
That said, a couple thoughts that came to mind:
1. having only 4 servers in 2 locations serving 3m customers a day seems crazy to me, atleast in the context of current practices regarding highly available systems.
2. not sure your cost comparisons are fair, in the first case you're talking about cloud costs (so including hardware, 3rd party services/api fees, etc), but in the second you're just talking hosting fees.
If your first company had a relatively static, hardware-heavy (gpus doing most of the work) workload, easily handled by a few servers -- then it would be crazy to pay for a cloud provider. And it wouldn't make much sense to bother with k8s or containers either (imo).
On the other hand, if the more recent company has a dynamic/spikey, software-heavy workload, with a ton of different services, orders of magnitude more infrastructure, and (being fintech) much more demanding SLAs... then it might make a lot of sense to use a cloud provider and take advantage of k8s. Especially if you're a start up that doesn't have the time/expertise to deal with datacenter design.
I agree that there's a lot of unnecessary fixation on the latest and greatest these days, but there are definitely situations where kubernetes can be very valuable.
This was all for a weather radar app, and you are correct, there really weren't any SLA's, but we had to handle very high loads. We did make use of cloud services for some pieces of the system (there was a database and a small API for some minor bookkeeping, mostly around users). I included those costs in my estimate of monthly expenses. We had lots of caches, for all our JSON and for things like user authentication, which saved us from having to really figure out the database side. The caches were typically push-based, so we didn't let user requests get to the disc, if we could help it.
The vast majority of requests were for those images though, which required moving lots of clumsy geographic data into the GPUs to render map tiles (at high-def and high zooms as well), so the requests were still somewhat costly to serve, even if they didn't hit a database. We were able to get away with a small footprint in the datacenter by making heavy use of CDN caching. Cache lifetimes for the latest weather images were often measured in seconds, and getting those timings right was crucial. Screwing up cache lifetimes would rapidly swamp the system with requests, but the software was good at continuing to keep latency low under heavy load, and degrading gracefully. In fact, the vast majority of bandwidth usage in the datacenters was actually not requests, but streaming geographic data from various government sources. We regularly had 50-100MB/s coming in, and we stored all of it in memory. The GPU machines had 100-200GB of memory, and we used all of it. We had to cycle through that memory pretty rapidly as well, so making sure allocations were low and memory was freed up on time was important.
It may not sound like we had much redundancy, but with all the caches, and each machine being quite powerful, it was better than it sounds in that regard. We often took machines in an out of nginx. The way the graceful degradation worked, we would prioritize the imagery from higher zooms (more zoomed out) so the worst that would happen on a typically day is that some very zoomed in images, in places few people were looking, might be slow or time out.
So, in the end, you are correct, the situations are different. The bank had to store things for a lot longer, and had to uphold more stringent SLA's and the like. That said, I still think they were flushing a lot of cash down the toilet, and making things over complicated :).
We've had to work very hard to allow for developers/sre/ops folks to be able provision vms and bare-metal machines in our datacenters the same way they would in the cloud provider that we use. Obviously its not as fast, seamless or feature-rich as it is with aws/gcp/azure et al, but I'm proud of the progress we've made.
What really kills me though, is that a huge chunk of our engineers seem to think our work is a complete waste of time in the first place. We have several physical dcs, and tens of thousands of machines... but since most engineers don't have to think about costs, or about workloads other than their own, they think of us as out of touch and clinging to the past.
Nothing worse than getting snark about our platform from an SRE who spends their days in a web app glueing together the ready made services of google and amazon while acting as if they're building the world of tomorrow :)
If such a tool does not exist do any of you feel that the creation of such a tool is within the realm of possibility?
I would imagine all these knobs could have default configurations that 99% of all users would be okay with and that the knob should only be exposed in a small amount of cases.
Don't get me wrong. I'd still probably build that as a monolith in Java instead of a thousand NodeJS services, but I can see how you end up with Kubernetes.
Let's be real, if you are old enough to get that reference without Googling, you probably would not have lasted that long at a hip fintech company anyways :-P