Rob (the author of the linked article) joined a few months later, and when we got too big for our Rackspace server, we looked at the cost of buying something and doing colo instead. The biggest challenge was trying to convince a vendor to let me use my Australian credit card but ship the server to a US address (we decided to use NYI for colo, based in NY). It turned out that IBM were able to do that, so they got our business. Both IBM and NYI were great for handling remote hands and hardware issues, which obviously we couldn't do from Australia.
A little bit later Bron joined us, and he automated absolutely everything, so that we were able to just have NYI plug in a new machine and it would set itself up from scratch. This all just used regular Linux capabilities and simple open source tools, plus of course a whole lot of Perl.
As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive for something that was more complex to manage and would have locked us into a specific vendor's tooling. But everyone seemed to be flocking to them.
To this day I still use bare metal servers for pretty much everything, and still love having the ability to use simple universally-applicable tools like plain Linux, Bash, Perl, Python, and SSH, to handle everything cheaply and reliably.
I've been doing some planning over the last couple of years on teaching a course on how to do all this, although I was worried that folks are too locked in to SaaS stuff -- but perhaps things are changing and there might be interest in that after all?...
In 2006 when the first aws instances showed up it would take you two years of on demand bills to match the cost of buying the hardware from a retail store and using it continuously.
Today it's between 2 weeks for ML workloads to three months for the mid sized instances.
AWS made sense in big Corp when it would take you six months to get approval for buying the hardware and another six for the software. Today I'd only use it to do a prototype that I move on prem the second it looks like it will make it past one quarter.
You are not the only one. There are several factors at play but I believe one of the strongest today is the generational divide: the people lost the ability to manage their own infra or don't know it well enough to do it well so it's true when they say "It's too much hassle". I say this as an AWS guy who occasionally works on on-prem infra.[0]
[0] As a side note, I don't believe the lack of skills is the main reason organizations have problem - skills can be learned, but if you mess up the initial architecture design, fixing that can easily take years.
IDK. More and more I see the argument of “I don’t know, and we are not experts in xxx” as a winning argument of why we should just spend money on 3rd party services and products.
I have seen people getting paid 700k plus a year spend their entire stay at companies writing papers about how they can’t do something and the obvious solution is to spend 400k plus to have some 3rd party handle it, and getting the budget.
Let’s not get into what the conversation looks like when somebody points out that we might have an issue if we are paying somebody 700k to hire somebody else temporarily for 400k each year, and that we should find these folks who can do it for 400k and just hire Them.
All this to say that being a SWE in many companies today requires no ability to create software that solves business problems. But rather some sort of quasi system administrator manager who will maybe write a handful of DSL scripts over the course of their career.
Your FastMail use case of (relatively) predictable server workload and product roadmap combined with agile Linux admins who are motivated to use close-to-bare-metal tools isn't an optimal cost fit for AWS. You're not missing anything and FastMail would have been overpaying for cloud.
Where AWS/GCP/Azure shine is organizations that need higher-level PaaS like managed DynamoDB, RedShift, SQS, etc that run on top of bare metal. Most non-tech companies with internal IT departments cannot create/operate "internal cloud services" that's on par with AWS.[1] Some companies like Facebook and Walmart can run internal IT departments with advanced capabilities like AWS but most non-tech companies can't. This means paying AWS' fat profit margins can actually be cheaper than paying internal IT salaries to "reinvent AWS badly" by installing MySQL, Kafka, etc on bare metal Linux. E.g. Netflix had their own datacenters in 2008 but a 3-day database outage that stopped them from shipping DVDs was one of the reasons they quit running their datacenters and migrated to AWS.[2] Their complex workload isn't a good fit for bare-metal Linux and bash scripts; Netflix uses a ton of high-level PaaS managed services from AWS.
If bare metal is the layer of abstraction the IT & dev departments are comfortable working at, then self-host on-premise, or co-lo, or Hetzner are all cheaper than AWS.
[1] https://web.archive.org/web/20160319022029/https://www.compu...
[2] https://media.netflix.com/en/company-blog/completing-the-net...
That said, most organizations are not nearly so agile as they'd like to believe and would probably be better off paying for something inflexible and cheap.
For some people the cloud is straight magic, but for many of us, it just represents work we don't have to do. Let "the cloud" manage the hardware and you can deliver a SaaS product with all the nines you could ask for...
> teaching a course on how to do all this ... there might be interest in that after all?
Idk about a course, but I'd be interested in a blog post or something that addresses the pain points that I conveniently outsource to AWS. We have to maintain SOC 2 compliance, and there's a good chunk of stuff in those compliance requirements around physical security and datacenter hygiene that I get to just point at AWS for.
I've run physical servers for production resources in the past, but they weren't exactly locked up in Fort Knox.
I would find some in-depth details on these aspects interesting, but from a less-clinical viewpoint than the ones presented in the cloud vendors' SOC reports.
Of course, their SOC 2 compliance doesn't mean we are absolved of securing our databases and services.
Theres a big gap between throwing some compute in a closet and having someone “run the closet” for you.
There is, a significantly larger gap between having someone “run the closet” and building your own datacenter from scratch.
We had some old Compaq (?) servers, most of the newer stuff was Dell. Mix of windows and Linux servers.
Even with the Dell boxes, things wasn't really standard across different server generations, and every upgrade was bespoke, except in cases when we bought multiple boxes for redundancy/scaling of a particular service.
What I'd like to see is something like oxide computer servers that scales way down at least down to quarter rack. Like some kind of Supermicro meets backlblaze storage pod - but riffing on Joyent's idea of colocating storage and compute. A sort of composable mainframe for small businesses in the 2020s.
I guess maybe that is part of what Triton is all about.
But anyway - somewhere to start, and grow into the future with sensible redundancies and open source bios/firmware/etc.
Not typical situation for today, where you buy two (for redundancy) "big enough" boxes - and then need to reinvent your setup/deployment when you need two bigger boxes in three years.
Flexibility.
When Netflix wanted to start operating in Europe, we didn't have to negotiate datacenter space, order a bunch of servers, wait for racking and stacking, and all those other things. We just made an API call and had an entire stack built in Europe.
Same thing we we expanded to Asia.
It also saved us a ton of money, because our workload was about 3x peak to trough each day. We would scale up for peak, and scale down for trough.
We used on-prem for the parts where that made sense -- serving the actual video bits. Those were done on custom servers with a very stripped down FreeBSD optimized just for serving video (so optimized that we still used Akamai for images). But the part of the business that needed flexibility (control plane and interface) were all in AWS.
Why would a startup use the cloud? Both flexibility and ease. There aren't a lot of experts around that can configure a linux box from scratch anymore. And even if you can, you can't go from coded-up idea to production in five minutes like you can with the cloud. It would take you at least a few hours to set up the bare metal the first time.
Like OVH, Hetzner or Hivelocity?
Because you can get some insane servers for like $300/month (eg brand new 5th gen Epyc 48-core / 0.5TB ram / lots of NVME) and globally available.
For businesses with <10 servers and half an IT person, the cost difference is practically irrelevant. EC2+EBS+snapshots is a magic bullet abstraction for most scenarios. Bare metal is nice until parts of it start to fail on you.
I can teach someone from accounting how to restore the entire VM farm in an afternoon using the AWS web console. I've never seen an on prem setup where a similar feat is possible. There's always some weird arcane exceptions due to economic compromises that Amazon was not forced to make. When you can afford to build a fleet of data centers, you can provide a degree of standardization in product offering that is extraordinarily hard to beat. If your main goal is to chase customers and build products for them, this kind of stuff goes a long way.
Long term you should always seek total autonomy over your information technology, but you should be careful to not let that goal ruin the principal business that underlies everything.
If your infrastructure consists of ten t2.micro instances vs ten Raspberry Pis, then sure. In any other case, migrating VM or bare metal workloads from your own hardware straight onto EC2 is one of the most effective ways in the world to incinerate money.
You can do well if you've got a workload well suited to 'native' PaaS services like S3 and Lambda, but EC2 costs a fortune.
My impression is the standard compute (as in CPUs+RAM) isn't expensive, it's the storage (1 PB is less than half a rack physically now, comparing with the yearly prices listed), and so if you don't have much data, the value of on-prem isn't there.
To the point we have young Devs today that dont know what VPS and Colo ( Colocation) meant.
Back to the article, I am surprised it was only a "A few years ago" Fastmail adopted SSD. Which certainly seems late in the cycle for the benefits of what SSD offers.
Price for Colo on the order of $3000/2U/year. That is $125 /U/month.
90% of emails are never read, 9% are read once. What SSD could offer for this use case except at least 2x cost ?
I can get an entire rack at Equinix for ~1200/mo with an unlimited 10g internet connect.
Yes! It's surprisingly common to hear it can't work, or can't scale or run reliably, when all that is done. Talking about how you've done it is great from that perspective.
Also, it's worth talking about what you gain, qualitatively! As this post mentions, your high-performance storage options are far better outside the cloud. People often mention egress, too. The appealing idea to me is using your extra flexibility to deploy better stuff, not saving a bit of cost.
Compare that to AWS, where there are 6 different kinds of remote hands, that work on all hardware and OSes, with no need for expertise, no time taken. No planning, no purchases, no shipment time, no waiting for remote hands to set it up, no diagnosing failures, etc, etc, etc...
That's just one thing. There's a thousand more things, just for a plain old VM. And the cloud provides way more than VMs.
The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this), and you have to have hot backup/spares, because otherwise you'll find out your spares don't work. Getting new gear in can take weeks (it "shouldn't" take that long, but there's little things like pandemics and global shortages on chips and disks that you can't predict). Power and cooling can go out. There's so many things that can (and eventually will) go wrong.
Why expose your business to that much risk, and have to build that much expertise? To save a few bucks on a server?
Prioritise simplicity.
For remote hands, 2 kinds is sufficient: IP KVM, and an actual person walking over to your machine. Can't say I've had an AWS person talk to me on a cell phone whilst standing at my server to help me sort out an issue.
It's actually really fun, and saving 90% what can be your largest cost can actually be a fundamental driver of startup success. You can undercut the competition on price and offer stuff that's just not available otherwise.
Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.
Complex cloud infra can also fail for all kinds of reasons, and they are often harder to troubleshoot than a hardware failure. My experience with server grade hardware in a reliable colo with a good uplink is it's generally an extremely reliable combination.
Cloud vendors are not immune from hardware failure. What do you think their underlying infrastructure runs on, some magical contraption made from Lego bricks, Swiss chocolate, and positive vibes?
It's the same hardware, prone to the same failures. You've just outsourced worrying about it.
But, it comes at a cost. And that cost is significant. Like magnitudes significant.
At what point does it become cheaper to hire an infra engineer? Let's see.
In the US a good infra engineer might cost you $150K/yr all in. That's not taking into account freelancers/contractors who can do it for less.
That's ~$12K/mo.
That's a lot of compute on AWS...but that's not the end of the story. Ever try getting data OUT of AWS? Yeah, those egress costs are not chump change. But that's not even the end of it.
The more important question is, what's the ratio of hosting/cloud costs to overall revenue? If colo/owned DC will yield better financials over ~few quarters, you'd be bananas as a CTO to recommend the cloud.
How do the availability/fault tolerance compare? If one of your geographical locations gets knocked out (fire, flood, network cutoff, war, whatever) what will the user experience look like, vs. what can cloud providers provide?
Now I'm wondering how much you'd look like tiangolo if you wore a moustache.
All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.
What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too. This happens much more often on sites like Reddit (r/sysadmin, even), but I wouldn't be surprised to see a little of it here.
It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?
I can clearly state why I advocate for avoiding cloud: cost, privacy, security, a desire to not centralize the Internet. The reason people advocate for cloud for others? It puzzles me. "You'll save money," "you can't secure your own machines," "it's simpler" all have worlds of assumptions that those people can't possibly know are correct.
So when I read something like this from Fastmail which was written without taking an emotional stance, I respect it. If I didn't already self-host email, I'd consider using Fastmail.
There used to be so much push for cloud everything that an article like this would get fanatical responses. I hope that it's a sign of progress that that fanaticism is waning and people aren't afraid to openly discuss how cloud isn't right for many things.
This is false. AWS infrastructure is vastly more secure than almost all company data centers. AWS has a rule that the same person cannot have logical access and physical access to the same storage device. Very few companies have enough IT people to have this rule. The AWS KMS is vastly more secure than what almost all companies are doing. The AWS network is vastly better designed and operated than almost all corporate networks. AWS S3 is more reliable and scalable than anything almost any company could create on their own. To create something even close to it you would need to implement something like MinIO using 3 separate data centers.
Secure in what terms? Security is always about a threat model and trade-offs. There's no absolute, objective term of "security".
> AWS has a rule that the same person cannot have logical access and physical access to the same storage device.
Any promises they make aren't worth anything unless there's contractually-stipulated damages that AWS should pay in case of breach, those damages actually corresponding to the costs of said breach for the customer, and a history of actually paying out said damages without shenanigans. They've already got a track record of lying on their status pages, so it doesn't bode well.
But I'm actually wondering what this specific rule even tries to defend against? You presumably care about data protection, so logical access is what matters. Physical access seems completely irrelevant no?
> Very few companies have enough IT people to have this rule
Maybe, but that doesn't actually mitigate anything from the company's perspective? The company itself would still be in the same position, aka not enough people to reliably separate responsibilities. Just that instead of those responsibilities being physical, they now happen inside the AWS console.
> The AWS KMS is vastly more secure than what almost all companies are doing.
See first point about security. Secure against what - what's the threat model you're trying to protect against by using KMS?
But I'm not necessarily denying that (at least some) AWS services are very good. Question is, is that "goodness" required for your use-case, is it enough to overcome its associated downsides, and is the overall cost worth it?
A pragmatic approach would be to evaluate every component on its merits and fitness to the problem at hand instead of going all in, one way or another.
1. big clouds are very lucrative targets for spooks, your data seem pretty likely to be hoovered up as "bycatch" (or maybe main catch depending on your luck) by various agencies and then traded around as currency
2. you never hear about security probems (incidents or exposure) in the platforms, there's no transparency
3. better than most coporate stuff is a low bar
so let's not fight the battle that will never be won. there is no point in convincing pro-cloud people that cloud isn't the right choice and vice-versa. let people share stories where it made sense and where it didn't.
as someone who has lived in cloud security space since 2009 (and was founder of redlock - one of the first CSPMs), in my opinion, there is no doubt that AWS is indeed superiorly designed than most corp. networks- but is that you really need? if you run entire corp and LOB apps on aws but have poor security practices, will it be right decision? what if you have the best security engineers in the world but they are best at Cisco type of security - configuring VLANS and managing endpoints but are not good at detecting someone using IMDSv1 in ec2 exposed to the internet and running a vulnerable (to csrf) app?
when the scope of discussion is as vast as cloud vs on-prem, imo, it is a bad idea to make absolute statements.
having the most secure data center doesn't matter if you load your secrets as env vars in a system that can be easily compromised by a motivated attacker
so i don't buy this argument as a general reason pro-cloud
It’s like putting something in someone’s desk drawer under the guise of convenience at the expense of security.
Why?
Too often, someone other than the data owner has or can get access to the drawer directly or indirectly.
Also, Cloud vs self hosted to me is a pendulum that has swung back and forth for a number of reasons.
The benefits of the cloud outlined here are often a lot of open source tech packaged up and sold as manageable from a web browser, or a command line.
One of the major reasons the cloud became popular was networking issues in Linux to manage volume at scale. At the time the cloud became very attractive for that reason, plus being able to virtualize bare metal servers to put into any combination of local to cloud hosting.
Self-hosting has become easier by an order of magnitude or two for anyone who knew how to do it, except it’s something people who haven’t done both self-hosting and cloud can really discuss.
Cloud has abstracted away the cost of horsepower, and converted it to transactions. People are discovering a fraction of the horsepower is needed to service their workloads than they thought.
At some point the horsepower got way beyond what they needed and it wasn’t noticed. But paying for a cloud is convenient and standardized.
Company data centres can be reasonably secured using a number of PaaS or IaaS solutions readily available off the shelf. Tools from VMware, Proxmox and others are tremendous.
It may seem like there’s a lot to learn, except most problems they are new to someone have often been thought of a ton by both people with and without experience that is beyond cloud only.
The biggest problem the cloud solves is hardware supply chain management. To realize the full benefits of doing your own build at any kind of non-trivial scale you will need to become an expert in designing, sourcing, and assembling your hardware. Getting hardware delivered when and where you need it is not entirely trivial -- components are delayed, bigger customers are given priority allocation, etc. The technical parts are relatively straightforward; managing hardware vendors, logistics, and delivery dates on an ongoing basis is a giant time suck. When you use the cloud, you are outsourcing this part of the work.
If you do this well and correctly then yes, you will reduce costs several-fold. But most people that build their own data infrastructure do a half-ass job of it because they (understandably) don't want to be bothered with any of these details and much of the nominal cost savings evaporate.
Very few companies do security as well as the major cloud vendors. This isn't even arguable.
On the other hand, you will need roughly the same number of people for operations support whether it is private data infrastructure or the cloud, there is little or no savings to be had here. The fixed operations people overhead scales to such a huge number of servers that it is inconsequential as a practical matter.
It also depends on your workload. The types of workloads that benefit most from private data infrastructure are large-scale data-intensive workloads. If your day-to-day is sling tens or hundreds of PB of data for analytics, the economics of private data infrastructure is extremely compelling.
You can rent servers and it's still not cloud.
I'm pretty neutral and definitely see the value of cloud. But a lot of cloud proponents seem to lack, what to me, seems like basic knowledge.
Isn't the job to be bothered with the details? 90% of employment for most people is doing shit you don't really want to be doing, but that's the job.
And IAM and other cloud security and management considerations is where the opex/capex and capability argument can start to break down. Turns out, the "cloud" savings comes from not having capabilities in house to manage hardware. Sometimes, for most businesses, you want some of that lovely reliability.
(In short, I agree with you, substantially).
Like code. It is easy to get something basic up, but substantially more resources are needed for non-trivial things.
I self-host a lot of things, but boy oh boy if I were running a company it would be a helluvalotta work to get IAM properly set up.
Something people neglect to mention when they tout their home grown cloud is that AWS spends significant cycles constantly eliminating technical debt that would absolutely destroy most companies - even ones with billion dollar services of their own. The things you rely on are constantly evolving and changing. It’s hard enough to keep up at the high level of a SaaS built on top of someone else’s bulletproof cloud. But imagine also having to keep up with the low level stuff like networking and storage tech?
No thanks.
With the cloud you have IT/DevOps deal only with scaling the software components of the infra. When doing on-prem they take on the physical layer as well. Do you have enough trust in them to scale the physical part where needed?
This is a very engineer-centric take. The cloud has some big advantages that are entirely non-technical:
- You don't need to pay for hardware upfront. This is critical for many early-stage startups, who have no real ability to predict CapEx until they find product/market fit.
- You have someone else to point the SOC2/HIPAA/etc auditors at. For anyone launching a company in a regulated space, being able to checkbox your entire infrastructure based on AWS/Azure/etc existing certifications is huge.
I would assume you still need to point auditors to your software in any case
This is worth astronomical amounts of money in big corps.
But once those are set up, how is it different? AWS is quite clear with their responsibility model that you still have to tune your DB, for example. And for the setup, just as there are Terraform modules to do everything under the sun, there are Ansible (or Chef, or Salt…) playbooks to do the same. For both, you _should_ know what all of the options are doing.
The only way I see this sentiment being true is that a dev team, with no infrastructure experience, can more easily spin up a lot of infra – likely in a sub-optimal fashion – to run their application. When it inevitably breaks, they can then throw money at the problem via vertical scaling, rather than addressing the root cause.
(To be fair, I can see why they did it - a lot of deployments were an absolute mess before.)
What do you mean, I can't scale up because I've used my hardware capex budget for the year?
Number one is company bureaucracy and politics. No one wants to beg another person or department, go on endless meetings just to have extra hardware provisioned. For engineers that alone is worth perhaps 99% of all current cloud margins.
Number two is also company bureaucracy and politics. CFOs dont like CapX. Turning it into OpeX makes things easier for them. Along with end of year company budget turning into Cloud credits for different departments. Especially for companies with government fundings.
Number three is really company bureaucracy and politics. Dealing with either Google, AWS and Microsoft meant you no longer have to deal with dozens of different vendors from on server, networking hardware, software licenses etc. Instead it is all pre-approved into AWS, GCP or Azure. This is especially useful for things that involves Government contracts or fundings.
There are also things like instant worldwide deployment. You can have things up and running in any regions within seconds. And useful when you have site that gets 10 to 1000x the normal traffic from time to time.
But then a lot of small business dont have these sort of issues. Especially non-consumer facing services. Business or SaaS are highly unlikely to get 10x more customers within short period of time.
I continue to wish there is a middle ground somewhere. You rent dedicated server for cheap as base load and use cloud for everything else.
The discussion matters when we are talking about building things: whether you self-host or use managed services is a set of interesting trade-offs.
To be fair, you did say “my tune might change past a certain size.” At small scale, nothing you do within reason really matters. World’s worst schema, but your DB is only seeing 100 QPS? Yeah, it doesn’t care.
PaaS is probably the way to go for small apps.
AWS, on the other hand, seems about as time consuming and hard as using root servers. You're at a higher level of abstraction, but the complexity is about the same I'd say. At least that's my experience.
You're not monitoring your deployments because "cloud"?
And moreover most of the actual interesting things, like having VM templates and stateless containers, orchestration, etc. is very easy to run yourself and gets you 99.9% of the benefits of the cloud.
About just any and every service is available as container file already written for you. And if it doesn't exist, it's not hard to plumb up.
A friend of mine runs more than 700 containers (yup, seven hundreds), split over his own rack at home (half of them) and the other half on dedicated servers (he runs stuff like FlightRadar, AI models, etc.). He'll soon get his own IP addresses space. Complete "chaos monkey" ready infra where you can cut any cable and the thing shall keep working: everything is duplicated, can be spun up on demand, etc. Someone could still his entire rack and all his dedicated server, he'd still be back operational in no time.
If an individual can do that, a company, no matter its size, can do it too. And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.
And another thing: there's even two in-betweens between "cloud" and "our own hardware located at our company". First is colocating your own hardware but in a datacenter. Second is renting dedicated servers from a datacenter.
They're often ready to accept cloud-init directly.
And it's not hard. I'd say learning to configure hypervisors on bare metal, then spin VMs from templates, then running containers inside the VMs is actually much easier than learning all the idiosyncrasies of all the different cloud vendors APIs and whatnots.
Funnily enough when the pendulum swung way too far on the "cloud all the things" side, those saying at some point we'd read story about repatriation were being made fun of.
Fully agreed. I don't have physical HA – if someone stole my rack, I would be SOL – but I can easily ride out a power outage for as long as I want to be hauling cans of gasoline to my house. The rack's UPS can keep it up at full load for at least 30 minutes, and I can get my generator running and hooked up in under 10. I've done it multiple times. I can lose a single server without issue. My only SPOF is internet, and that's only by choice, since I can get both AT&T and Spectrum here, and my router supports dual-WAN with auto-failover.
> And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.
THIS. So many people have no idea how tremendously fast computers are, and how much of an impact latency has on speed. I've benchmarked my 12-year old Dells against the newest and shiniest RDS and Aurora instances on both MySQL and Postgres, and the only ones that kept up were the ones with local NVMe disks. Mine don't even technically have _local_ disks; they're NVMe via Ceph over Infiniband.
Does that scale? Of course not; as soon as you want geo-redundant, consistent writes, you _will_ have additional latency. But most smaller and medium companies don't _need_ that.
In particular, there is a limit to paying for competence, and paying more money doesn't automatically get you more competence, which is especially perilous if your organization lacks the competence to judge competence. In the limit case, this gets you the Big N consultancies like PWC or EY. It's entirely reasonable to hire PWC or EY to run your accounting or compliance. Hiring PWC or EY to run your software development lifecycle is almost guaranteed doom, and there is no shortage of stories on this site to support that.
In comparison, if you're one of these organizations, who don't yet have baseline competence in technology, then what the public cloud is selling is nothing short of magical: You pay money, and, in return, you receive a baseline set of tools, which all do more or less what they say they will do. If no amount of money would let you bootstrap this competence internally, you'd be much more willing to pay a premium for it.
As an anecdote, my much younger self worked in mid-sized tech team in a large household brand in a legacy industry. We were building out a web product that, for product reasons, had surprisingly high uptime and scalability requirements, relative to legacy industry standards. We leaned heavily on public cloud and CDNs. We used a lot of S3 and SQS, which allowed us to build systems with strong reliability characteristics, despite none of us having that background at the time.
I think there are accounting reasons for companies to prefer paying opex to run things on the cloud instead of more capex-intensive self-hosting, but I don’t understand the dynamics well.
It’s certainly the case that clouds tend to be more expensive than self-hosting, even when taking account of the discounts that moderately sized customers can get, and some of the promises around elastic scaling don’t really apply when you are bigger.
To some of your other points: the main customers of companies like AWS are businesses. Businesses generally don’t care about the centralisation of the internet. Businesses are capable of reading the contracts they are signing and not signing them if privacy (or, typically more relevant to businesses, their IP) cannot be sufficiently protected. It’s not really clear to me that using a cloud is going to be less secure than doing things on-prem.
This is where you lose all credibility.
I'm going to focus on a single aspect: performance. If you're serving a global user base and your business, like practically all online businesses, is greatly impacted by performance problems, the only solution to a physics problem is to deploy your application closer to your users.
With any cloud provider that's done with a few clicks and an invoice of a few hundred bucks a month. If you're running your hardware... What solution do you have to show for? Do you hope to create a corporate structure to rent a place to host your hardware manned by a dedicated team? What options f you have?
I ping HN, it's 150ms away, it still renders in the same time that the Google frontpage does and that one has a 130ms advantage.
Getting the hardware closer to the users has always been trivial - call up any of the many hosting providers out there and get a dedicated server, or a colo and ship them some hardware (directly from the vendor if needed).
People who write that, well...
If you're greatly impacted by performance problems, how does that become a physics problem that has as a solution which is being closer to your users?
I think you're mixing up your sales points. One, how do you scale hardware? Simple: you buy some more, and/or you plan for more from the beginning.
How do you deal with network latency for users on the other side of the planet? Either you plan for and design for long tail networking, and/or you colocate in multiple places, and/or you host in multiple places. Being aware of cloud costs, problems and limitations doesn't mean you can't or shouldn't use cloud at all - it just means to do it where it makes sense.
You're making my point for me - you've got emotional generalizations ("you lose all credibility"), you're using examples that people use often but that don't even go together, plus you seem to forget that hardly anyone advocates for all one or all the other, without some kind of sensible mix. Thank you for making a good example of exactly what I'm talking about.
I love building the cool edge network stuff with expensive bleeding edge hardware, smartnics, nvmeOF, etc but its infinitely more complicated and stressful than terraforming an AWS infra. Every cluster I set up I had to interact with multiple teams like networking, security, storage sometimes maintenance/electrical, etc. You've got some random tech you have to rely on across the country in one of your POPs with a blown server. Every single hardware infra person has had a NOC tech kick/unplug a server at least once if they've been in long enough.
And then when I get the hardware sometimes you have different people doing different parts of setup, like NOC does the boot, maybe boostraps the hardware with something that works over ssh before an agent is installed (ansible, etc), then your linux eng invokes their magic with a ton of bash or perl, then your k8s person sets up the k8s clusters with usually something like terraform/puppet/chef/salt probably calling helm charts. Then your monitoring person gets it into OTEL/grafana, etc. This all organically becomes more automated as time goes on, but I've seen it from a brand new infra where you've got no automation many times.
Now you're automating 90% of this via scripts and IAC, etc, but you're still doing a lot of tedious work.
You also have a much more difficult time hiring good engineers. The markets gone so heavily AWS (I'm no help) that its rare that I come across an ops resume that's ever touched hardware, especially not at the CDN distributed systems level.
So.. aws is the chill infra that stays online and you can basically rely on 99.99something%. Get some terraform blueprints going and your own developers can self serve. Don't need hardware or ops involved.
And none of this is even getting into supporting the clusters. Failing clusters. Dealing with maintenance, zero downtime kernel upgrades, rollbacks, yaddayadda.
I’ve worked at tech companies with hundreds of developers and single digit ops staff. Those people will struggle to build and maintain mature infra. By going cloud, you get access to mature infra just by including it in build scripts. Devops is an effective way to move infra back to project teams and cut out infra orgs (this isn’t great but I see it happen everywhere). Companies will pay cloud bills but not staffing salaries.
Computation has become a utility these days - this includes the fat ISP lines and connectivity etc, not just the CPU and harddrives. These things have economies of scale that smaller companies cannot truly reach, and will pay a huge fixed cost if they want state of the art management, monitoring and redundancy. So unless you are a massive consumer, just like power stations, you really don't need nor want to build your own.
The irony is absolutely dripping off this comment, wow.
Commenter makes emotionally charge comment with no data or facts and decries anyone who disagrees with them as "silly talking points" for not caring about data and facts.
Your comment is entirely talking about itself.
DevOps and kubernetes come to mind. A lot of people using kubernetes don't know what they're getting into, and k0s or another single machine solution would have been enough for 99% of SMEs.
In terms of cyber security (my field) everything got so ridiculously complex that even the folks that use 3 different dashboards in parallel will guess the answers as to whether or not they're affected by a bug/RCE/security flaw/weakness because all of the data sources (even the expensively paid for ones) are human-edited text databases. They're so buggy that they even have Chinese idiom symbols instead of a dot character in the version fields without anyone ever fixing it upstream in the NVD/CVE process.
I started to build my EDR agent for POSIX systems specifically, because I hope that at some point this can help companies to ditch the cloud and allows them to selfhost again - which in return would indirectly prevent 13 year old kids like from LAPSUS to pwn major infrastructure via simple tech support hotline calls.
When I think of it in terms of hosting, the vertical scalability of EPYC machines is so high that most of the time when you need its resources you are either doing something completely wrong and you should refactor your code or you are a video streaming service.
I'd expect that there are people who moved to the cloud then, and over time started using services offered by their cloud provider (e.g., load balancers, secret management, databases, storage, backup) instead of running those services themselves on virtual machines, and now even if it would be cheaper to run everything on owned servers they find it would be too much effort to add all those services back to their own servers.
Elasticity is a component, but has always been from a batch job bin packing scheduling perspective, not much new there. Before k8s and Nomad, there was Globus.org.
(Infra/DevOps in a previous life at a unicorn, large worker cluster for a physics experiment prior, etc; what is old is a new again, you’re just riding hype cycle waves from junior to retirement [mainframe->COTS on prem->cloud->on prem cloud, and so on])
2. People therefore repeat talking points which seem in their interest
3. With enough repetition these become their beliefs
4. People will defend their beliefs as theirs against attack
5. Goto 1
That came from technical people who I didn't perceive as being dogmatically pro-cloud.
As an industry, we are largely trading correctness and performance for convenience, and this is not seen as a negative by most. What kills me is that at every cloud-native place I've worked at, the infra teams were both responsible for maintaining and fixing the infra that product teams demanded, but were not empowered to push back on unreasonable requests or usage patterns. It's usually not until either the limits of vertical scaling are reached, or a SEV0 occurs where these decisions were the root cause does leadership even begin to consider changes.
If you enable Multi-AZ for RDS, your bill doubles until you cancel. If you set up two servers in two DCs, your initial bill doubles from the CapEx, and then a very small percentage of your OpEx goes up every month for the hosting. You very, very quickly make this back compared to cloud.
After that, yeah we’ll let AWS do the hard work of enabling redundancy for us.
I feel like this can be applied to anything.
I had a manager take one SAFe for Leaders class then came back wanting to implement it. They had no previous AGILE classes or experience. And the Enterprise Agile Office was saying DON'T USE SAFe!!
But they had one class and that was the only way they would agree to structure their group.
I once worked for several years at a publicly traded firm well-known for their return-to-on-prem stance, and honestly it was a complete disaster. The first-party hardware designs didn't work right because they didn't have the hardware designs staffing levels to have de-risked to possibility that AMD would fumble the performance of Zen 1, leaving them with a generation of useless hardware they nonetheless paid for. The OEM hardware didn't work right because they didn't have the chops to qualify it either, leaving them scratching their heads for months over a cohort of servers they eventually discovered were contaminated with metal chips. And, most crucially, for all the years I worked there, the only thing they wanted to accomplish was failover from West Coast to East Coast, which never worked, not even once. When I left that company they were negotiating with the data center owner who wanted to triple the rent.
These experiences tell me that cloud skeptics are sometimes missing a few terms in their equations.
It's been my experience that those who can build good, reliable, high-quality systems, can do so either in the cloud or on-prem, generally with equal ability. It's just another platform to such people, and they will use it appropriately and as needed.
Those who can only make it work in the cloud are either building very simple systems (which is one place where the cloud can be appropriate), or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support).
Engineering is engineering. Not everyone in the business does it, unfortunately.
Like everything, the cloud has its place -- but don't underestimate the number of decisions that get taken out of the hands of technical people by the business people who went golfing with their buddy yesterday. He just switched to Azure, and it made his accountants really happy!
The whole CapEx vs. OpEx issue drives me batty; it's the number one cause of cloud migrations in my career. For someone who feels like spent money should count as spent money regardless of the bucket it comes out of, this twists my brain in knots.
I'm clearly not a finance guy...
Yes. Mass psychosis explains an incredible number of different and apparently unrelated problems with the industry.
Those take on the liability of sourcing, managing and maintaining the hardware for a flat monthly fee, and would take on such risk. If they make a bad bet purchasing hardware, you won't be on the hook for it.
This seems like a point many pro-cloud people (intentionally?) overlook.
What's the market share of Windows again? ;)
> If I didn't already self-host email
this really says all that needs to be said about your perspective. you have an engineer and OSS advocate's mindset. which is fine, but most business leaders (including technical leaders like CTOs) have a business mindset, and their goal is to build a business that makes money, not avoid contributing to the centralization of the internet
From a cost PoV, sure, but when you're taking money out of capex it represents a big hit to the cash flow, while taking out twice that amount from opex has a lower impact on the company finances.
I use AWS cloud a lot, and almost never use any VMs or instances. Most instances I use are along the lines of a simple anemic box for a bastion host or some such.
I use higher level abstractions (services) to simplify solutions and outsource maintenance of these services to AWS.
You can't even blame them too much, the amount of cash poured into cloud marketing is astonishing.
Cloud has definite advantages in some circumstances, but so does self-hosting; moreover, understanding the latter makes the former much, much easier to reason about. It’s silly to limit your career options.
It seems like they all abandoned their VMware farms or physical server farms for Azure (they love Microsoft).
Are they actually saving money? Are things faster? How's performance? What was the re-training/hiring like?
In one case I know we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc
And the developers (and many of the admins) we had knew nothing about hardware or anything so keeping the physical hardware around probably wouldn't have made sense anyways
That is, even if things became cheaper/faster, they might have been even better without cloud infrastructure.
Seems a lot of those DevOps people just see Azures recommendations for adding indexes and either just allow auto applying them or just adding them without actually reviewing it understanding what use loads require them and why. This also lands a bit on developers/product that don't critically think about and communicate what queries are common and should have some forethought on what indexes should be beneficial and created. (Yes followup monitoring of actual index usage and possible missing indexes is still needed.) Too many times I've seen dozens of indexes on tables in the cloud where one could cover all of them. Yes, there still might be worthwhile reasons to keep some narrower/smaller indexes but again DBA and critical query analysis seems to be a forgotten and neglected skill. No one owns monitoring and analysing db queries and it only comes up after a fire has already broken out.
There is a size where self-hosting makes sense, but it's much larger than you think.
> a desire to not centralize the Internet
This is an ideological stance! I happen to share this desire. But you should be aware of your own non-technical - "emotional" - biases when dismissing the arguments of others on the grounds that they are "emotional" and+l "fanatical".
I do think it's more than just emotional, though, but most people, even technical people, haven't taken the time to truly consider the problems that will likely come with centralization. That's a whole separate discussion, though.
There's not nearly enough in here to make a judgment about things like security or privacy. They have the bare minimum encryption enabled. That's better than nothing. But how is key access handled? Can they recover your email if the entire cluster goes down? If so, then someone has access to the encryption keys. If not, then how do they meet reliability guarantees?
Three letter agencies and cyber spies like to own switches and firewalls with zero days. What hardware are they using, and how do they mitigate against backdoors? If you really cared about this you would have to roll your own networking hardware down to the chips. Some companies do this, but you need to have a whole lot of servers to make it economical.
It's really about trade-offs. I think the big trade-offs favoring staying off cloud are cost (in some applications), distrust of the cloud providers,and avoiding the US Government.
The last two are arguably judgment calls that have some inherent emotional content. The first is calculable in principle, but people may not be using the same metrics. For example if you don't care that much about security breaches or you don't have to provide top tier reliability, then you can save a ton of money. But if you do have to provide those guarantees, it would be hard to beat Cloud prices.
I’m sure I’ll be downvoted to hell for this, but I’m convinced that it’s largely their insecurities being projected.
Running your own hardware isn’t tremendously difficult, as anyone who’s done it can attest, but it does require a much deeper understanding of Linux (and of course, any services which previously would have been XaaS), and that’s a vanishing trait these days. So for someone who may well be quite skilled at K8s administration, serverless (lol) architectures, etc. it probably is seen as an affront to suggest that their skill set is lacking something fundamental.
And running your own hardware is not incompatible with Kubernetes: on the contrary. You can fully well have your infra spin up VMs and then do container orchestration if that's your thing.
And part your hardware monitoring and reporting tool can work perfectly fine from containers.
Bare metal -> Hypervisor -> VM -> container orchestration -> a container running a "stateless" hardware monitoring service. And VMs themselves are "orchestrated" too. Everything can be automated.
Anyway say a harddisk being to show errors? Notifications being sent (email/SMS/Telegram/whatever) by another service in another container, dashboard shall show it too (dashboards are cool).
Go to the machine once the spare disk as already been resilvered, move it where the failed disk was, plug in a new disk that becomes the new spare.
Boom, done.
I'm not saying all self-hosted hardware should do container orchestration: there are valid use cases for bare metal too.
But something as to be said about controlling everything on your own infra: from the bare metal to the VMs to container orchestration. To even potentially your own IP address space.
This is all within reach of an individual, both skill-wise and price-wise (including obtaining your own IP address space). People who drank the cloud kool-aid should ponder this and wonder how good their skills truly are if they cannot get this up and working.
Add in compliance, auditing, etc. all things that you can set up out of the box (PCI, HIPPA, lawsuit retention). Gets even cheaper.
Same sentiment all of what you said.
Are you new to the internet?
This feels like "no true scotsman" to me. I've been building software for close to two decades, but I guess I don't have "any real technical understanding" because I think there's a compelling case for using "cloud" services for many (honestly I would say most) businesses.
Nobody is "afraid to openly discuss how cloud isn't right for many things". This is extremely commonly discussed. We're discussing it right now! I truly cannot stand this modern innovation in discourse of yelling "nobody can talk about XYZ thing!" while noisily talking about XYZ thing on the lowest-friction publishing platforms ever devised by humanity. Nobody is afraid to talk about your thing! People just disagree with you about it! That's ok, differing opinions are normal!
Your comment focuses a lot on cost. But that's just not really what this is all about. Everyone knows that on a long enough timescale with a relatively stable business, the total cost of having your own infrastructure is usually lower than cloud hosting.
But cost is simply not the only thing businesses care about. Many businesses, especially new ones, care more about time to market and flexibility. Questions like "how many servers do we need? with what specs? and where should we put them?" are a giant distraction for a startup, or even for a new product inside a mature firm.
Cloud providers provide the service of "don't worry about all that, figure it out after you have customers and know what you actually need".
It is also true that this (purposefully) creates lock-in that is expensive either to leave in place or unwind later, and it definitely behooves every company to keep that in mind when making architecture decisions, but lots of products never make it to that point, and very few of those teams regret the time they didn't spend building up their own infrastructure in order to save money later.
For businesses, it's a very typical lease-or-own decision. There's really nothing too special about cloud.
> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.
Nope. Not if you factor-in 24/7 support, geographic redundancy, and uptime guarantees. With EC2 you can break even at about $2-5m a year of cloud spending if you want your own hardware.
If we used AWS, we could skip months of certification. If we use a custom data center, we have to certify it ourselves (muuuuuch more expensive).
From this standpoint, cloud beats on-premise.
If you have predictable workloads, a competent engineering culture that fights against process culture, and are willing to spend the money to have good hardware and the people to man it 24x7x365 then I don’t think cloud makes sense at all. Seems like that’s what y’all have and you should keep up with it.
If it takes this long to manage a machine, I strongly suspect it means that when initially designing the system engineers had failed to account for those for some reason. Was that true in your case?
Back in late '00s until mid '10s, I worked for an ISP startup as a SWE. We had a few core machines (database, RADIUS server, self-service website, etc) - ugly mess TBH - initially provisioned and originally managed entirely by hand as we didn't knew any better back then. Naturally, maintaining those was a major PITA, so they sat on the same dated distro for years. That was before Ansible was a thing, and we haven't really heard about Salt or Chef before we started to feel the pains and started to search for solutions. Virtualization (OpenVZ, then Docker) helped to soften a lot of issues, making it significantly easier to maintain the components, but the pains from our original sins were felt for a long time.
But we also had a fleet of other machines, where we understood our issues with the servers enough to design new nodes to be as stateless as possible, with automatic rollout scripts for whatever we were able to automate. Provisioning a new host took only a few hours, with most time spent unpacking, driving, accessing the server room, and physically connecting things. Upgrades were pretty easy too - reroute customers to another failover node, write a new system image to the old one, reboot, test, re-route traffic back, done.
So it's not like self-owned bare metal is harder to manage - the lesson I learned is that one just gotta think ahead of time what the future would require. Same as the clouds, I guess, one has to follow best practices or they'll end up with crappy architectures that will be painful to rework. Just different set of practices, because of the different nature of the systems.
Are you running a well understood and predictable (as in, little change, growth, nor feature additions) system? Are your developers handing over to central platform/infra/ops teams? You'll probably save some cash by buying and owning the hardware you need for your use case(s). Elasticity is (probably) not part of your vocabulary, perhaps outside of "I wish we had it" anyway.
Have you got teams and/or products that are scaling rapidly or unpredictably? Have you still got a lot of learning and experimenting to do with how your stack will work? Do you need flexibility but can't wait for that flexibility? Then cloud is for you.
n.b. I don't think I've ever felt more validated by a post/comment than yours.
Bonus points: they can do it with spot pricing to further lower the bill.
The cloud offers immense flexibility and empowers _developers_ to easily manage their own infrastructure without depending on other teams.
Speed of development is the primary reason $DayJob is moving into the cloud, while maintaining bare-metal for platforms which rarely change.
You could get same day builds deployed on prem with the right support bundle!
in case you want to ballpark-estimate your move off of the cloud
Bonus points: I'm a Fastmail customer, so it tangentially tracks
----
Quick note about the article: ZFS encryption can be flaky, be sure you know what you're doing before deploying for your infrastructure.
Relevant Reddit discussion: https://www.reddit.com/r/zfs/comments/1f59zp6/is_zfs_encrypt...
A spreadsheet of related issues that I can't remember who made:
https://docs.google.com/spreadsheets/d/1OfRSXibZ2nIE9DGK6sww...
This is the current script - it runs every minute for each pool synced between the two log servers: https://gist.github.com/brong/6a23fee1480f2d62b8a18ade5aea66...
LUKS2 has something like 9 key slots.
I run ZoL over LUKS2 and it works great.
1) "At this rate, we’ll replace these [SSD] drives due to increased drive sizes, or entirely new physical drive formats (such E3.S which appears to finally be gaining traction) long before they get close to their rated write capacity."
and
2) "We’ve also anecdotally found SSDs just to be much more reliable compared to HDDs (..) easily less than one tenth the failure rate we used to have with HDDs."
The new NVMe drives we've only had for a few years, but so far there's only been a single failure across the whole fleet, and we keep spares in stock. It's been very reliable, not like the weeks back in (hmm, 2006? 2007?) the ancient past, when we were losing 15kRPM velociraptors every other day. They had a firmware fault and we eventually got an update which made them reliable, but it was a wild few months.
We had one outage where key rotation had been enabled on reboot, so data partitions were lost after what should have been a routine crash. Overall, for data warehousing, our failure rate on on-prem (DC-hosted) hardware was lower IME.
Things like identity management (AAD/IAM), provisioning and running VMs, deployments. Network side of things like VNet, DNS, securely opening ports etc. Monitoring setup across the stack. There is so much functionalities that will be required to safely expose an application externally that I can't even coherently list them out here. Are people just using Saas for everything (which I think will defeat the purpose of on-prem infra) or a competent Sys admin can handle all this to give a cloud like experience for end developers?
Can someone share their experience or share any write ups on this topic?
For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it. Hosting application was done by copying the binaries on a particular well known machine and running npm commands and restarting nginx. Log a ticket with sys admin to create a DNS entry to point a reserve and point a internal DNS to this machine (no load balancer). Deployment was a shell script which rcp new binaries and restarts nginx. No monitoring or observability stack. There was a script which will log you into a random machine for you to run your workloads (be ready to get angry IMs from more senior quants running their workload in that random machine if your development build takes up enough resources to effect their work). I can go on and on but I think you get the idea.
It's clunky, but simple, repeatable, and easily (vsfo) understood.
As for the bigger things, software etc - we have scripts that generate Debian packages which we store in our own private repo. You just install `fastmail-server` and the dependency management updates everything. There's a daily cronjob which checks if there are updated security packages or thing we failed to correctly deploy and emails us as well.
It's amazing what you can build on top of the OS provided tools with not too much complexity if you don't overthink it.
Do you mean for administrative access to the machines (over SSH, etc) or for "normal" access to the hosted applications?
Admin access: Ansible-managed set of UNIX users & associated SSH public keys, combined with remote logging so every access is audited and a malicious operator wiping the machine can't cover their tracks will generally get you pretty far. Beyond that, there are commercial solutions like Teleport which provide integration with an IdP, management web UI, session logging & replay, etc.
Normal line-of-business access: this would be managed by whatever application you're running, not much different to the cloud. But if your application isn't auth-aware or is unsafe to expose to the wider internet, you can stick it behind various auth proxies such as Pomerium - it will effectively handle auth against an IdP and only pass through traffic to the underlying app once the user is authenticated. This is also useful for isolating potentially vulnerable apps.
> provisioning and running VMs
Provisioning: once a VM (or even a physical server) is up and running enough to be SSH'd into, you should have a configuration management tool (Ansible, etc) apply whatever configuration you want. This would generally involve provisioning users, disabling some stupid defaults (SSH password authentication, etc), installing required packages, etc.
To get a VM to an SSH'able state in the first place, you can configure your hypervisor to pass through "user data" which will be picked up by something like cloud-init (integrated by most distros) and interpreted at first boot - this allows you to do things like include an initial SSH key, create a user, etc.
To run VMs on self-managed hardware: libvirt, proxmox in the Linux world. bhyve in the BSD world. Unfortunately most of these have rough edges, so commercial solutions there are worth exploring. Alternatively, consider if you actually need VMs or if things like containers (which have much nicer tooling and a better performance profile) would fit your use-case.
> deployments
Depends on your application. But let's assume it can fit in a container - there's nothing wrong with a systemd service that just reads a container image reference in /etc/... and uses `docker run` to run it. Your deployment task can just SSH into the server, update that reference in /etc/ and bounce the service. Evaluate Kamal which is a slightly fancier version of the above. Need more? Explore cluster managers like Hashicorp Nomad or even Kubernetes.
> Network side of things like VNet
Wireguard tunnels set up (by your config management tool) between your machines, which will appear as standard network interfaces with their own (typically non-publicly-routable) IP addresses, and anything sent over them will transparently be encrypted.
> DNS
Generally very little reason not to outsource that to a cloud provider or even your (reputable!) domain registrar. DNS is mostly static data though, which also means if you do need to do it in-house for whatever reason, it's just a matter of getting a CoreDNS/etc container running on multiple machines (maybe even distributed across the world). But really, there's no reason not to outsource that and hosted offerings are super cheap - so go open an AWS account and configure Route53.
> securely opening ports
To begin with, you shouldn't have anything listening that you don't want to be accessible. Then it's not a matter of "opening" or closing ports - the only ports that actually listen are the ones you want open by definition because it's your application listening for outside traffic. But you can configure iptables/nftables as a second layer of defense, in case you accidentally start something that unexpectedly exposes some control socket you're not aware of.
> Monitoring setup across the stack
collectd running on each machine (deployed by your configuration management tool) sending metrics to a central machine. That machine runs Grafana/etc. You can also explore "modern" stuff that the cool kids play with nowadays like VictoriaMetrics, etc, but metrics is mostly a solved problem so there's nothing wrong with using old tools if they work and fit your needs.
For logs, configure rsyslogd to log to a central machine - on that one, you can have log rotation. Or look into an ELK stack. Or use a hosted service - again nothing prevents you from picking the best of cloud and bare-metal, it's not one or the other.
> safely expose an application externally
There's a lot of snake oil and fear-mongering around this. First off, you need to differentiate between vulnerabilities of your application and vulnerabilities of the underlying infrastructure/host system/etc.
App vulnerabilities, in your code or dependencies: cloud won't save you. It runs your application just like it's been told. If your app has an SQL injection vuln or one of your dependencies has an RCE, you're screwed either way. To manage this you'd do the same as you do in cloud - code reviews, pentesting, monitoring & keeping dependencies up to date, etc.
Infrastructure-level vulnerabilities: cloud providers are responsible for keeping the host OS and their provided services (load balancers, etc) up to date and secure. You can do the same. Some distros provide unattended updates (which your config management tool) can enable. Stuff that doesn't need to be reachable from the internet shouldn't be (bind internal stuff to your Wireguard interfaces). Put admin stuff behind some strong auth - TLS client certificates are the gold standard but have management overheads. Otherwise, use an IdP-aware proxy (like mentioned above). Don't always trust app-level auth. Beyond that, it's the usual - common sense, monitoring for "spooky action at a distance", and luck. Not too much different from your cloud provider, because they won't compensate you either if they do get hacked.
> For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it...
Nomad or Kubernetes.
I ended up switching to Protonmail, because of privacy (Fastmail is within the Five Eyes (Australia)), which is the only thing I really like about Protonmail. But I'm considering switching back to Fastmail, because I liked it so much.
To expand: At $dayjob we use AWS, and we have no plans to switch because we're tiny, like ~5000 DAU last I checked. Our AWS bill is <$600/mo. To get anything remotely resembling the reliability that AWS gives us we would need to spend tens of thousands up-front buying hardware, then something approximating our current AWS bill for colocation services. Or we could host fully on-prem, but then we're paying even more up-front for site-level stuff like backup generators and network multihoming.
Meanwhile, RDS (for example) has given us something like one unexplained 15-minute outage in the last six years.
Obviously every situation is unique, and what works for one won't work for another. We have no expectation of ever having to suddenly 10x our scale, for instance, because we our growth is limited by other factors. But at our scale, given our business realities, I'm convinced that the cloud is the best option.
Very few non-cloud users are buying their own hardware. You can simply rent dedicated hardware in a datacenter. For significantly cheaper than anything in the cloud. That being said, certain things like object storage, if you don't need very large amounts of data, are very handy and inexpensive from cloud services considering the redundancy and uptime they offer.
I should note that Microsoft also does this.
Global permissions, seamless organization and IaC. If you are Fastmail or a small startup - go buy some used dell poweredge with epycs in some Colo rack with 10Gbe transit and save tons of money.
If you are a company with tons of customers, ton's of requirements it's powerful to put each concern into a landing zone, run some bicep/terraform - have a ressource group to control costs and get savings on overall core-count and be done with it.
Assign permissions into a namespace for your employe or customer - have some back and forth about requirements and it's done. No need to sysadmin across servers. No need to check for broken disks.
I'm also blaming the hell of vmware and virtual machines for everything that is a PITA to maintain as a sysadmin but is loved because it's common knowledge. I would only do k8s on bare-metal today and skip the whole virtualization thing completly. I guess it's also these pains that are softened in the cloud.
Ive even worked in companies where the engineering team spent effort and time on building "scalable infrastructure" before the product itself even found product-market fit...
find /path/to/subtree -name -type f | parallel -j250 rm --
rm -r /path/to/subtree
A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.
By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:
https://github.com/openzfs/zfs/blob/zfs-2.2.0/module/zfs/met...
Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.
For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:
Deletion of files depends on how they have configured the message store - they may be storing a lot of data into a database, for example.
I think that 80% figure is from when drives were much smaller and finding free space over that threshold with the first-fit allocator was harder.
For now, the "file storage" product is a Node tree in mysql, with content stored in a content-addressed blob store, which is some custom crap I wrote 15 years ago that is still going strong because it's so simple there's not much to go wrong.
We do plan to eventually move the blob storage into Cyrus as well though, because then we have a single replication and backup system rather than needing separate logic to maintain the blob store.
Today, the cloud isn’t about other people’s hardware.
It’s about infrastructure being an API call away. Not just virtual machines but also databases, load-balancers, storage, and so on.
The cost isn’t the DC or the hardware, but the hours spend on operations.
And you can abuse developers to do operations on the side :-)
And then it is still also about other people's hardware in addition to that.
> Fastmail has some of the best uptime in the business, plus a comprehensive multi data center backup system. It starts with real-time replication to geographically dispersed data centers, with additional daily backups and checksummed copies of everything. Redundant mirrors allow us to failover a server or even entire rack in the case of hardware failure, keeping your mail running.
"A private inbox $60 for 12 months". I assume it is USD, not AU$ (AFAIK, Fastmail is based in Australia.) Still pricey.
At https://www.infomaniak.com/ I can buy email service for an (in my case external) domain for 18 Euro a year and I get 5 inboxes. And it is based in Switzerland, so no EU or US jurisdiction.
I have a few websites and fastmail would just be prohibitive expensive for me.
I've used them for 20 years now. Highly recommended.
Migadu is just all around good, only downsides I can find are subjective. The fact that they're based in Switzerland and unless you're "good with computers" something like Fastmail will probably be better.
I'm paying something like $10 per year for multiple domains with multiple email addresses (though with little traffic). I've been using them for about 5 years and I had absolutely no issues.
The only things I wish FM had are all software:
1. A takeout-style API to let me grab a complete snapshot once a week with one call
2. The ability to be an IdP for Tailscale.
Custom compression code can introduce bugs that can kill Fastmail's reputation of reliability.
It's better to use a well tested solution that cost a bit more.
Frankly given emails are normally ~4kB objects I suspect the compression overheads are probably not that worth it unless it's for attachments only. Not attacking ZFS it's compression and checksumming are among best in class, but the compression would work better if it weren't limited to small files. Here ZFS has made a lot of wins I've not had a problem with many files on ZFS due to L1/L2 ARC but the cost is metadata ops can be painful on many small files.
The evidence they IOPS limited it that they went for SSD or better when they could store the same capacity on rust for much cheaper now.
Yeah I think moving the compression or file access up to abstract what is being written to disk ala protonmail (I don't like their offerings, but like their tech) means you can have compression over 4MB not 4kB blocks which matters when you recall data from disks for , I don't know... Backups or search?
also remember RAID!=backups ;)
If you're looking at ZFS on NVMe you may want to look at Alan Jude's talk on the topic, "Scaling ZFS for the future", from the 2024 OpenZFS User and Developer Summit:
* https://www.youtube.com/watch?v=wA6hL4opG4I
* https://openzfs.org/wiki/OpenZFS_Developer_Summit_2024
There are some bottlenecks that get in the way of getting all the performance that the hardware often is capable of.
Also, who cares if a single filesystem dies, that's why you have inter-server replication. Nuke the bad server and rebuild before the next 3 or 4 die.
Plus it has protocol consistency sanity checks built in.
Plus, I wrote it :p
[1]https://hanselminutes.com/847/engineering-stack-overflow-wit...
https://stackoverflow.blog/2023/08/30/journey-to-the-cloud-p...
The why is is the interesting part of this article.
"Although we’ve only ever used datacenter class SSDs and HDDs failures and replacements every few weeks were a regular occurrence on the old fleet of servers. Over the last 3+ years, we’ve only seen a couple of SSD failures in total across the entire upgraded fleet of servers. This is easily less than one tenth the failure rate we used to have with HDDs."
other than that, i'm happy with fastmail.
Fastmail explicitly says that moving mail to/from a spam folder via a mail client does not automaticallyl retrain. <https://www.fastmail.help/hc/en-us/articles/1500000278142-Im...> (I never did figure out if Gmail acts the same way or not.)
The cloud providers really kill you on IO for your VMs. Even if 'remote' SSDs are available with configurable ($$) IOPs/bandwidth limits, the size of your VM usually dictates a pitiful max IO/BW limit. In Azure, something like a 4-core 16GB RAM VM will be limited to 150MB/s across all attached disks. For most hosting tasks, you're going to hit that limit far before you max out '4 cores' of a modern CPU or 16GB of RAM.
On the other hand, if you buy a server from Dell and run your own hypervisor, you get a massive reserve of IO, especially with modern SSDs. Sure, you have to share it between your VMs, but you own all of the IO of the hardware, not some pathetic slice of it like in the cloud.
As is always said in these discussions, unless you're able to move your workload to PaaS offerings in the cloud (serverless), you're not taking advantage of what large public clouds are good at.
(my experience with managed kubernetes)
1. The cost of the server is not the cost of on-prem. There are so many different kinds of costs that aren't just monetary. ("we have to do more ourselves, including planning, choosing, buying, installing, etc,") Those are tasks that require expertise (which 99% of "engineers" do not possess at more than a junior level), and time, and staff, and correct execution. They are much more expensive than you will ever imagine. Doing any of them wrong will causes issues that will eventually cost you business (customers fleeing, avoiding). That's much worse than a line-item cost.
2. You have to develop relationships for good on-prem. In order to get good service in your rack (assuming you don't hire your own cage monkey), in order to get good repair people for your hardware service accounts, in order to ensure when you order a server that it'll actually arrive, in order to ensure the DC won't fuck up the power or cooling or network, etc. This is not something you can just read reviews on. You have to actually physically and over time develop these relationships, or you will suffer.
3. What kind of load you have and how you maintain your gear is what makes a difference between being able to use one server for 10 years, and needing to buy 1 server every year. For some use cases it makes sense, for some it really doesn't.
4. Look at all the complex details mentioned in this article. These people go deep, building loads of technical expertise at the OS level, hardware level, and DC level. It takes a long time to build that expertise, and you usually cannot just hire for it, because it's generally hard to find. This company is very unique (hell, their stack is based on Perl). Your company won't be that unique, and you won't have their expertise.
5. If you hire someone who actually knows the cloud really well, and they build out your cloud env based on published well-architected standards, you gain not only the benefits of rock-solid hardware management, but benefits in security, reliability, software updates, automation, and tons of unique features like added replication, consistency, availability. You get a lot more for your money than just "managed hardware", things that you literally could never do yourself without 100 million dollars and five years, but you only pay a few bucks for it. The value in the cloud is insane.
6. Everyone does cloud costs wrong the first time. If you hire somebody who does have cloud expertise (who hopefully did the well-architected buildout above), they can save you 75% off your bill, by default, with nothing more complex than checking a box and paying some money up front (the same way you would for your on-prem server fleet). Or they can use spot instances, or serverless. If you choose software developers who care about efficiency, they too can help you save money by not needing to over-allocate resources, and right-sizing existing ones. (Remember: you'd be doing this cost and resource optimization already with on-prem to make sure you don't waste those servers you bought, and that you know how many to buy and when)
7. The major takeaway at the end of the article is "when you have the experience and the knowledge". If you don't, then attempting on-prem can end calamitously. I have seen it several times. In fact, just one week ago, a business I work for had three days of downtime, due to hardware failing, and not being able to recover it, their backup hardware failing, and there being no way to get new gear in quickly. Another business I worked for literally hired and fired four separate teams to build an on-prem OpenStack cluster, and it was the most unstable, terrible computing platform I've used, that constantly caused service outages for a large-scale distributed system.
If you're not 100% positive you have the expertise, just don't do it.
I've seen similarly unstable cloud systems. It's generally not the tool's fault, it's the skill of the wielder.
good on them, understanding infrastructure and cost/benefit is essential in any business you hope to run for the long haul
The heading in this space makes your think they're running custom FPGAs such as with Gmail, not just running on metal... As for drive failures, welcome to storage at scale. Build your solution so it's a weekly task to replace 10disks at a time not critical at 2am when a single disk dies...
Storing/Accessing tonnes of <4kB files is difficult, but other providers are doing this on their own metal with CEPH at the PB scale.
I love ZFS, it's great with per-disk redundancy but CEPH is really the only game in town for inter-rack/DC resilience which I would hope my email provider has.