I do tech DD work for investment funds etc and one thing I often see are slow, complex and expensive AWS-heavy architectures that optimize for problems the company doesn’t have and often will never have. In theory to ensure stability and scalability. They are usually expensive and have nightmarish configuration complexity.
In practice complexity tends to lead to more outages and performance issues than if you had a much simpler (rented) bare metal setup with some spare capacity and better architecture design. More than half of serious outages I have seen documented in these reviews came from configuration mistakes or bugs in software that is supposed to manage your resources.
Nevermind that companies invest serious amounts of time in trying to manage complexity rather than remove it.
A few years ago I worked for a company that had two competing systems. One used AWS sparingly: just EC2, S3, RDS and load balancers. The other went berserk in the AWS candy shop and was this monstrosity that used 20-something different AWS services glued together by lambdas. This was touted as “the future”, and everyone who didn’t think it was a good idea was an idiot.
The simple solution cost about the same to run for a few thousand (business customers) as the complex one cost for ONE customer. The simple solution cost about 1/20 to develop. It also had about 1/2500 the latency on average because it wasn’t constantly enqueuing and dequeueing data through a slow SQS maze of queues.
And best of all: you could move the simpler solution to bare metal servers. In fact, we ran all the testing on clusters of 6 RPIs. The complex solution was stuck in AWS forever.
Heck their support is shit too. I have talked to them to figure out an issue on their own in house software, they couldn’t help. My colleague happened to know what was wrong and fixed the issue with a switch of a checkbox.
But if you have bare metal with fast disk drives, everything changes. You can get decent performance at a lower price in exchange for taking on a bit more responsibility. So then the question becomes how much of a burden it is to manage what is essentially just another application.
AWS doesn’t just rent computers, it rents relief from responsibility, and prices raw performance to make that trade feel inevitable.
Most people do not operate services that cannot bear very occasional downtime. But they have been conditioned to think they do. Or to not consider other factors that influence their actual downtime.
For instance: we ran a service that in itself achieved 99.99% uptime (allowing about 52 minutes of downtime per year). We even survived a big AWS outage that took out almost everyone else because we had as much redundancy as we could afford. However, the service depended completely on a system totally outside our control that would, on average have multiple outages every day (well usually at night, but not always). Ranging from 30 second blips to an hour. Meaning that the customers would have to deal with this anyway. No matter how stable our systems were.
And yet, for years we obsessed about uptime needlessly. Our customers didn’t care. They had to deal with the unreliability of the upstream system anyway. It didn’t cost us that much money, but it did make everything more complex.
Now, back to the question: do you need RDS? When was the last time you set up and ran Postgres? When was the last time you set up replication and live backups? How hard was it the first time? How hard was it to repeat after doing it once?
If you are already on bare metal servers you may want to at least try to set up Postgres a few times and track cost in terms of time, money and complexity. Because if you use RDS, chances are it isn’t the only thing you are managing in the cloud.
I use Coolify for side projects, haven’t investigated whether I’d want to use it for bigger/importanter stuff.
But if you do need one, I guess Kubernetes is perhaps the safe bet. Not so much because I think it is better/worse than anything else, but because you can easily find people who know it and it has a big ecosystem around it. I'd probably recommend Kubernetes if I were forced to make a general recommendation.
That being said, this has been something that I've been playing with a bit over the years. I've been exploring both ends of the spectrum. What I realized is that we tend to waste a lot of time on this with very little to show for it in terms of improved service reliability.
On one extreme we built a system that has most of the control plane as a layer in the server application. Then external to that we monitored performance and essentially had one lever: add or remove capacity. The coordination layer in the service figured out what to do with additional resources. Or how to deal with resources disappearing. There was only one binary and the service would configure itself to take on one of several roles as needed. All the way down to all of the roles if you are the last process running. (Almost nobody cares about the ability to scale all the way down, but it is nice when you can demo your entire system on a portable rack of RPis - and then just turn them off one by one without the service going down)
On the other extreme is having a critical look at what you really need and realize that if the worst case means a couple of hours of downtime a couple of times per year, you can make do with very little. Just systemd deb packages and SSH access is sufficient for an awful lot of more forgiving cases.
I also dabbled a bit in running systems by having a smallish piece of Go code remote-manage a bunch of servers running Docker. People tend to laugh about this, but it was easy to set up, it is easy to understand and it took care of everything that the service needed. The kubernetes setup that replaced it has had 4-5 times the amount of downtime. But to be fair, the person who took over the project went a bit overboard and probably wasn't the best qualified to manage kubernetes to begin with.
It seems silly to not take advantage of Docker having an API that works perfectly well. (I'd research Podman if I were to do this again).
I don't understand why more people don't try the simple stuff first when the demands they have to meet easily allow for it.
I want a 1985 Mercedes that is build like a tank and outlives me.
And in computing, having a bit of downtime 1-2 times per year is often a price worth paying if avoiding it requires 90% more cost and effort. (Of course, people end up having downtime anyway because they have something so complex that they have 100x the number of ways something can fail).
If you are a very big SaaS company that is not Google or Apple, you are probably serving hundreds of thousands, maybe millions of unique users. AWS may be convenient, but you don't /need/ it, you can build an infrastructure that will handle such workload with any of the big european providers.
You'll just lose in comfort what you'll gain in data sovereignty and infrastructure costs.
I worked for a 7M€ MRR company that had maybe a million of users who used the software every day. The thing ran on a dozen of OVH servers, including multi-site redundancy.
In times when one physical server can have 32, 64 or even 96 cores... you pack your own little datacenter right there and it's pretty cheap to simply overkill it, have one or two servers for redundancy and bye.
So many businesses will happily run from 4 core 10usd VPS (that would have been beefy server 20 years ago).
Your point's a little moot.
The basic services are more or less the same, but the hyperscalers provide hundreds of services where smaller providers have only ten.
This is just my opinion, but there are some services that just package software as VM and let's you spawn it with a fancy button, leaving you with a largely unmanaged instance.
There are other services like S3, BigQuery or SQS that feels like magic.
It is easy to argue that it is expensive and complex. Since it is. And lots of people have made that argument. I don’t think I’ve seen anyone argue in favor of AWS while skimming the threads here.
So this is your opportunity to make the case for AWS.
It _used_ to be great and free tier made it easy enough to migrate most personal use cases to their infrastructure. But they have enshittified the free tier to a point where it’s unusable without forking over obscene amounts of money.
Plus their support is non-existent unless you are one of those big corps.
Plus for a 1T+ company. You would think that their infrastructure would be top tier, never go done, best practices?
Nope. us-east1 continues to be dogshit and their typical response is to fork over more money for multi region and zone support.
And yes, the scale at which aws advertises is largely overkill for many companies. Even some Fortune 500.
But technology is driven by clueless C-level executives that get easily impressed by deck presentations from aws marketing.
Instead of investing in workforce. They invest in cLoUd.
It’s a huge joke.
Computing at this scale is not marketed to flashy fanbois.
Every vain CxO is a flashy fanboi at heart