Why we use our own hardware (opens in new tab)

(fastmail.com)

933 pointsnmjenkins1y ago529 comments

529 comments

The original answer to "why does FastMail use their own hardware" is that when I started the company in 1999 there weren't many options. I actually originally used a single bare metal server at Rackspace, which at that time was a small scrappy startup. IIRC it cost $70/month. There weren't really practical VPS or SaaS alternatives back then for what I needed.

Rob (the author of the linked article) joined a few months later, and when we got too big for our Rackspace server, we looked at the cost of buying something and doing colo instead. The biggest challenge was trying to convince a vendor to let me use my Australian credit card but ship the server to a US address (we decided to use NYI for colo, based in NY). It turned out that IBM were able to do that, so they got our business. Both IBM and NYI were great for handling remote hands and hardware issues, which obviously we couldn't do from Australia.

A little bit later Bron joined us, and he automated absolutely everything, so that we were able to just have NYI plug in a new machine and it would set itself up from scratch. This all just used regular Linux capabilities and simple open source tools, plus of course a whole lot of Perl.

As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive for something that was more complex to manage and would have locked us into a specific vendor's tooling. But everyone seemed to be flocking to them.

To this day I still use bare metal servers for pretty much everything, and still love having the ability to use simple universally-applicable tools like plain Linux, Bash, Perl, Python, and SSH, to handle everything cheaply and reliably.

I've been doing some planning over the last couple of years on teaching a course on how to do all this, although I was worried that folks are too locked in to SaaS stuff -- but perhaps things are changing and there might be interest in that after all?...

llm_trw1y ago

>As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive for something that was more complex to manage and would have locked us into a specific vendor's tooling. But everyone seemed to be flocking to them.

In 2006 when the first aws instances showed up it would take you two years of on demand bills to match the cost of buying the hardware from a retail store and using it continuously.

Today it's between 2 weeks for ML workloads to three months for the mid sized instances.

AWS made sense in big Corp when it would take you six months to get approval for buying the hardware and another six for the software. Today I'd only use it to do a prototype that I move on prem the second it looks like it will make it past one quarter.

bluGill1y ago

Aws is useful if you have uneven loads. why pay for the number of servers you need for christmas the rest of the year? But if your load is more even it doesn't make as much sense.

3 more replies

PeterStuer1y ago

AWS was built on hordes of VC backed startups drowning in heaps of cash and very little operational expertise.

1 more reply

ForOldHack1y ago

"buying the hardware from a retail store." Never buy wholesale and never develop on immature hardware, I have seen c** with multiple 9 y.o. dev servers. I could shorten the ROI to less than 6 months.

1 more reply

benterix1y ago

> As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing.

You are not the only one. There are several factors at play but I believe one of the strongest today is the generational divide: the people lost the ability to manage their own infra or don't know it well enough to do it well so it's true when they say "It's too much hassle". I say this as an AWS guy who occasionally works on on-prem infra.[0]

[0] As a side note, I don't believe the lack of skills is the main reason organizations have problem - skills can be learned, but if you mess up the initial architecture design, fixing that can easily take years.

mbrumlow1y ago

> I don't believe the lack of skills is the main reason organizations have problem

IDK. More and more I see the argument of “I don’t know, and we are not experts in xxx” as a winning argument of why we should just spend money on 3rd party services and products.

I have seen people getting paid 700k plus a year spend their entire stay at companies writing papers about how they can’t do something and the obvious solution is to spend 400k plus to have some 3rd party handle it, and getting the budget.

Let’s not get into what the conversation looks like when somebody points out that we might have an issue if we are paying somebody 700k to hire somebody else temporarily for 400k each year, and that we should find these folks who can do it for 400k and just hire Them.

All this to say that being a SWE in many companies today requires no ability to create software that solves business problems. But rather some sort of quasi system administrator manager who will maybe write a handful of DSL scripts over the course of their career.

lobsterthief1y ago

It’s also human capital/resource allocation. We thought about spinning up our own servers at my last gig; we had the talent in house but that talent was busy building the product, not managing servers. I suppose it depends on what your need is as well.

1 more reply

jasode1y ago

>As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing. They seemed orders of magnitude more expensive [...] To this day I still use bare metal servers for pretty much everything, [...] plain Linux, Bash, Perl, Python, and SSH, to handle everything cheaply

Your FastMail use case of (relatively) predictable server workload and product roadmap combined with agile Linux admins who are motivated to use close-to-bare-metal tools isn't an optimal cost fit for AWS. You're not missing anything and FastMail would have been overpaying for cloud.

Where AWS/GCP/Azure shine is organizations that need higher-level PaaS like managed DynamoDB, RedShift, SQS, etc that run on top of bare metal. Most non-tech companies with internal IT departments cannot create/operate "internal cloud services" that's on par with AWS.[1] Some companies like Facebook and Walmart can run internal IT departments with advanced capabilities like AWS but most non-tech companies can't. This means paying AWS' fat profit margins can actually be cheaper than paying internal IT salaries to "reinvent AWS badly" by installing MySQL, Kafka, etc on bare metal Linux. E.g. Netflix had their own datacenters in 2008 but a 3-day database outage that stopped them from shipping DVDs was one of the reasons they quit running their datacenters and migrated to AWS.[2] Their complex workload isn't a good fit for bare-metal Linux and bash scripts; Netflix uses a ton of high-level PaaS managed services from AWS.

If bare metal is the layer of abstraction the IT & dev departments are comfortable working at, then self-host on-premise, or co-lo, or Hetzner are all cheaper than AWS.

[1] https://web.archive.org/web/20160319022029/https://www.compu...

[2] https://media.netflix.com/en/company-blog/completing-the-net...

causal1y ago

Right, AWS rarely saves on hardware/hosting costs, it saves developer-hours. Especially if you're a fast-moving organization that rapidly changing hardware needs, something like AWS gives you agility.

That said, most organizations are not nearly so agile as they'd like to believe and would probably be better off paying for something inflexible and cheap.

packtreefly1y ago

> although I was worried that folks are too locked in to SaaS stuff

For some people the cloud is straight magic, but for many of us, it just represents work we don't have to do. Let "the cloud" manage the hardware and you can deliver a SaaS product with all the nines you could ask for...

> teaching a course on how to do all this ... there might be interest in that after all?

Idk about a course, but I'd be interested in a blog post or something that addresses the pain points that I conveniently outsource to AWS. We have to maintain SOC 2 compliance, and there's a good chunk of stuff in those compliance requirements around physical security and datacenter hygiene that I get to just point at AWS for.

I've run physical servers for production resources in the past, but they weren't exactly locked up in Fort Knox.

I would find some in-depth details on these aspects interesting, but from a less-clinical viewpoint than the ones presented in the cloud vendors' SOC reports.

dijit1y ago

I’ve never visited a datacenter that wasn’t SOC2 compliant. Bahnhof, SAVVIS, Telecity, Equinox etc.

Of course, their SOC 2 compliance doesn't mean we are absolved of securing our databases and services.

Theres a big gap between throwing some compute in a closet and having someone “run the closet” for you.

There is, a significantly larger gap between having someone “run the closet” and building your own datacenter from scratch.

1 more reply

jph001y ago

You're describing stuff the colo provider does. I have no plans to describe how to setup a colo provider. I've never done that, and haven't seen the need. The cost of colo is not that significant.

milesvp1y ago

As someone who lived through that era, I can tell you there are legions of devs and dev adjacent people who have no idea what it’s like to automate mission critical hardware. Everyone had to do it in the early 2000s. But it’s been long enough that there are people in the workforce who just have no idea about running your own hardware since they never had to. I suspect there is a lot of interest, especially since we’re likely approaching the bring it back in house cycle, as CTOs try to reign in their cloud spend.

e12e1y ago

I used to help manage a couple of racks worth of on premise hw in early to mid 2000.

We had some old Compaq (?) servers, most of the newer stuff was Dell. Mix of windows and Linux servers.

Even with the Dell boxes, things wasn't really standard across different server generations, and every upgrade was bespoke, except in cases when we bought multiple boxes for redundancy/scaling of a particular service.

What I'd like to see is something like oxide computer servers that scales way down at least down to quarter rack. Like some kind of Supermicro meets backlblaze storage pod - but riffing on Joyent's idea of colocating storage and compute. A sort of composable mainframe for small businesses in the 2020s.

I guess maybe that is part of what Triton is all about.

But anyway - somewhere to start, and grow into the future with sensible redundancies and open source bios/firmware/etc.

Not typical situation for today, where you buy two (for redundancy) "big enough" boxes - and then need to reinvent your setup/deployment when you need two bigger boxes in three years.

Voultapher1y ago

Yeah, having something like oxide but smaller would be awesome.

jedberg1y ago

In my 25 years, I've run some really big on-prem workloads and some of the biggest cloud loads (Sendmail.org and it's mail servers and Netflix streaming). Here is why I like the cloud:

Flexibility.

When Netflix wanted to start operating in Europe, we didn't have to negotiate datacenter space, order a bunch of servers, wait for racking and stacking, and all those other things. We just made an API call and had an entire stack built in Europe.

Same thing we we expanded to Asia.

It also saved us a ton of money, because our workload was about 3x peak to trough each day. We would scale up for peak, and scale down for trough.

We used on-prem for the parts where that made sense -- serving the actual video bits. Those were done on custom servers with a very stripped down FreeBSD optimized just for serving video (so optimized that we still used Akamai for images). But the part of the business that needed flexibility (control plane and interface) were all in AWS.

Why would a startup use the cloud? Both flexibility and ease. There aren't a lot of experts around that can configure a linux box from scratch anymore. And even if you can, you can't go from coded-up idea to production in five minutes like you can with the cloud. It would take you at least a few hours to set up the bare metal the first time.

tiffanyh1y ago

When you say “cloud”, are you including old school web hosts that will rent you a dedicated server?

Like OVH, Hetzner or Hivelocity?

Because you can get some insane servers for like $300/month (eg brand new 5th gen Epyc 48-core / 0.5TB ram / lots of NVME) and globally available.

1 more reply

bob10291y ago

AWS is only expensive if you intend to run a lot of workloads and have a large, competent technical team.

For businesses with <10 servers and half an IT person, the cost difference is practically irrelevant. EC2+EBS+snapshots is a magic bullet abstraction for most scenarios. Bare metal is nice until parts of it start to fail on you.

I can teach someone from accounting how to restore the entire VM farm in an afternoon using the AWS web console. I've never seen an on prem setup where a similar feat is possible. There's always some weird arcane exceptions due to economic compromises that Amazon was not forced to make. When you can afford to build a fleet of data centers, you can provide a degree of standardization in product offering that is extraordinarily hard to beat. If your main goal is to chase customers and build products for them, this kind of stuff goes a long way.

Long term you should always seek total autonomy over your information technology, but you should be careful to not let that goal ruin the principal business that underlies everything.

bigfatkitten1y ago

> For businesses with <10 servers and half an IT person, the cost difference is practically irrelevant.

If your infrastructure consists of ten t2.micro instances vs ten Raspberry Pis, then sure. In any other case, migrating VM or bare metal workloads from your own hardware straight onto EC2 is one of the most effective ways in the world to incinerate money.

You can do well if you've got a workload well suited to 'native' PaaS services like S3 and Lambda, but EC2 costs a fortune.

aragilar1y ago

I'm confused why you would even need AWS then (what's running on the VMs)?

My impression is the standard compute (as in CPUs+RAM) isn't expensive, it's the storage (1 PB is less than half a rack physically now, comparing with the yearly prices listed), and so if you don't have much data, the value of on-prem isn't there.

1 more reply

edithpixie1y ago

For many people and businesses, navigating the frequently dangerous landscape of financial loss can be an intimidating and overwhelming process. Nevertheless, the knowledgeable staff at Wizard Hilton Cyber Tech provides a ray of hope and direction with their indispensable range of services. Their offerings are based on a profound grasp of the far-reaching and terrible effects that financial setbacks, whether they be the result of cyberattacks, data breaches, or other unforeseen tragedies, can have. Their highly-trained analysts work tirelessly to assess the scope of the damage, identifying the root causes and developing tailored strategies to mitigate the fallout. From recovering lost or corrupted data to restoring compromised systems and securing networks, Wizard Hilton Cyber Tech employs the latest cutting-edge technologies and industry best practices to help clients regain their financial footing. But their support goes beyond the technical realm, as their compassionate case managers provide a empathetic ear and practical advice to navigate the emotional and logistical challenges that often accompany financial upheaval. With a steadfast commitment to client success, Wizard Hilton Cyber Tech is a trusted partner in weathering the storm of financial loss, offering the essential services and peace of mind needed to emerge stronger and more resilient than before.

basilgohar1y ago

Please do this course. It's still needed and a lot of people would benefit from it. It's just that the loudest voices are all in on Cloud that it seems otherwise.

ksec1y ago

>But everyone seemed to be flocking to them.

To the point we have young Devs today that dont know what VPS and Colo ( Colocation) meant.

Back to the article, I am surprised it was only a "A few years ago" Fastmail adopted SSD. Which certainly seems late in the cycle for the benefits of what SSD offers.

Price for Colo on the order of $3000/2U/year. That is $125 /U/month.

justsomehnguy1y ago

> Which certainly seems late in the cycle for the benefits of what SSD offers.

90% of emails are never read, 9% are read once. What SSD could offer for this use case except at least 2x cost ?

1 more reply

brongondwana1y ago

We adopted SSD for the current week's email and rust for the deeper storage many years ago. A few years ago we switched to everything on NVMe, so there's no longer two tiers of storage. That's when the pricing switched to make it worthwhile.

matt-p1y ago

Colo is typically sold on power not space, from your example you're either getting ripped off if it's for low power servers or massively undercharged for a 4xa100 machine

kapone1y ago

What??

I can get an entire rack at Equinix for ~1200/mo with an unlimited 10g internet connect.

flemhans1y ago

HDDs are still the best option for many workloads, including email.

twotwotwo1y ago

> I've been doing some planning over the last couple of years on teaching a course on how to do all this

Yes! It's surprisingly common to hear it can't work, or can't scale or run reliably, when all that is done. Talking about how you've done it is great from that perspective.

Also, it's worth talking about what you gain, qualitatively! As this post mentions, your high-performance storage options are far better outside the cloud. People often mention egress, too. The appealing idea to me is using your extra flexibility to deploy better stuff, not saving a bit of cost.

0xbadcafebee1y ago

You know how to set up a rock-solid remote hands console to all your servers, I take it? Dial-up modem to a serial console server, serial cables to all the servers (or IPMI on a segregated network and management ports). Then you deal with varying hardware implementations, OSes, setting that up in all your racks in all your colos.

Compare that to AWS, where there are 6 different kinds of remote hands, that work on all hardware and OSes, with no need for expertise, no time taken. No planning, no purchases, no shipment time, no waiting for remote hands to set it up, no diagnosing failures, etc, etc, etc...

That's just one thing. There's a thousand more things, just for a plain old VM. And the cloud provides way more than VMs.

The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this), and you have to have hot backup/spares, because otherwise you'll find out your spares don't work. Getting new gear in can take weeks (it "shouldn't" take that long, but there's little things like pandemics and global shortages on chips and disks that you can't predict). Power and cooling can go out. There's so many things that can (and eventually will) go wrong.

Why expose your business to that much risk, and have to build that much expertise? To save a few bucks on a server?

jph001y ago

It's really not like that at all. If it was, I expect after 25 years of growth FastMail would probably have noticed. Much of what you're describing assumes a poorly run company that isn't able to make good choices -- if you have such a mix of odd hardware os OSes then that's pretty bad sign.

Prioritise simplicity.

For remote hands, 2 kinds is sufficient: IP KVM, and an actual person walking over to your machine. Can't say I've had an AWS person talk to me on a cell phone whilst standing at my server to help me sort out an issue.

It's actually really fun, and saving 90% what can be your largest cost can actually be a fundamental driver of startup success. You can undercut the competition on price and offer stuff that's just not available otherwise.

Every time this conversation has come up online over the last few decades there's always a few people who parrot this claim it's all too hard. I can't imagine these comments come from people that have actually gone and done it.

1 more reply

jread1y ago

> Hardware can fail for all kinds of reasons

Complex cloud infra can also fail for all kinds of reasons, and they are often harder to troubleshoot than a hardware failure. My experience with server grade hardware in a reliable colo with a good uplink is it's generally an extremely reliable combination.

1 more reply

likeabatterycar1y ago

> The number of failures you can have on-prem is insane. Hardware can fail for all kinds of reasons (you must know this)

Cloud vendors are not immune from hardware failure. What do you think their underlying infrastructure runs on, some magical contraption made from Lego bricks, Swiss chocolate, and positive vibes?

It's the same hardware, prone to the same failures. You've just outsourced worrying about it.

1 more reply

kapone1y ago

ok...?

But, it comes at a cost. And that cost is significant. Like magnitudes significant.

At what point does it become cheaper to hire an infra engineer? Let's see.

In the US a good infra engineer might cost you $150K/yr all in. That's not taking into account freelancers/contractors who can do it for less.

That's ~$12K/mo.

That's a lot of compute on AWS...but that's not the end of the story. Ever try getting data OUT of AWS? Yeah, those egress costs are not chump change. But that's not even the end of it.

The more important question is, what's the ratio of hosting/cloud costs to overall revenue? If colo/owned DC will yield better financials over ~few quarters, you'd be bananas as a CTO to recommend the cloud.

1 more reply

switch0071y ago

This. All of this and more. I've got friends who worked for a hosting providers who over the years have echoed this comment. It's endless.

dataflow1y ago

> As the fortunes of AWS et al rose and rose and rose, I kept looking at their pricing at features and kept wondering what I was missing.

How do the availability/fault tolerance compare? If one of your geographical locations gets knocked out (fire, flood, network cutoff, war, whatever) what will the user experience look like, vs. what can cloud providers provide?

riezebos1y ago

As a customer of Fastmail and a fan of your work at FastAI and FastHTML I feel a bit stupid now for not knowing you started Fastmail.

Now I'm wondering how much you'd look like tiangolo if you wore a moustache.

jph001y ago

Now I wonder what he'd look like without the moustache :)

brongondwana1y ago

Jeremy is all the Fast things!

ForOldHack1y ago

" teaching a course on how to do all this..." Can you provide some notice of this so I can schedule my vacation time to fully participate? Let me know when registration is open.

lowsong1y ago

What is the software side of things like? Is your team managing these servers directly — or is it "cloud like" with containers (Kubernetes?), IaC tools, etc.

bob_theslob6461y ago

I would gladly take your course if you offered it.

johnklos1y ago

The whole push to the cloud has always fascinated me. I get it - most people aren't interested in babysitting their own hardware. On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.

All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.

What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too. This happens much more often on sites like Reddit (r/sysadmin, even), but I wouldn't be surprised to see a little of it here.

It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?

I can clearly state why I advocate for avoiding cloud: cost, privacy, security, a desire to not centralize the Internet. The reason people advocate for cloud for others? It puzzles me. "You'll save money," "you can't secure your own machines," "it's simpler" all have worlds of assumptions that those people can't possibly know are correct.

So when I read something like this from Fastmail which was written without taking an emotional stance, I respect it. If I didn't already self-host email, I'd consider using Fastmail.

There used to be so much push for cloud everything that an article like this would get fanatical responses. I hope that it's a sign of progress that that fanaticism is waning and people aren't afraid to openly discuss how cloud isn't right for many things.

UltraSane1y ago

"All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding,"

This is false. AWS infrastructure is vastly more secure than almost all company data centers. AWS has a rule that the same person cannot have logical access and physical access to the same storage device. Very few companies have enough IT people to have this rule. The AWS KMS is vastly more secure than what almost all companies are doing. The AWS network is vastly better designed and operated than almost all corporate networks. AWS S3 is more reliable and scalable than anything almost any company could create on their own. To create something even close to it you would need to implement something like MinIO using 3 separate data centers.

noprocrasted1y ago

> AWS infrastructure is vastly more secure than almost all company data centers

Secure in what terms? Security is always about a threat model and trade-offs. There's no absolute, objective term of "security".

> AWS has a rule that the same person cannot have logical access and physical access to the same storage device.

Any promises they make aren't worth anything unless there's contractually-stipulated damages that AWS should pay in case of breach, those damages actually corresponding to the costs of said breach for the customer, and a history of actually paying out said damages without shenanigans. They've already got a track record of lying on their status pages, so it doesn't bode well.

But I'm actually wondering what this specific rule even tries to defend against? You presumably care about data protection, so logical access is what matters. Physical access seems completely irrelevant no?

> Very few companies have enough IT people to have this rule

Maybe, but that doesn't actually mitigate anything from the company's perspective? The company itself would still be in the same position, aka not enough people to reliably separate responsibilities. Just that instead of those responsibilities being physical, they now happen inside the AWS console.

> The AWS KMS is vastly more secure than what almost all companies are doing.

See first point about security. Secure against what - what's the threat model you're trying to protect against by using KMS?

But I'm not necessarily denying that (at least some) AWS services are very good. Question is, is that "goodness" required for your use-case, is it enough to overcome its associated downsides, and is the overall cost worth it?

A pragmatic approach would be to evaluate every component on its merits and fitness to the problem at hand instead of going all in, one way or another.

3 more replies

fulafel1y ago

OTOH:

1. big clouds are very lucrative targets for spooks, your data seem pretty likely to be hoovered up as "bycatch" (or maybe main catch depending on your luck) by various agencies and then traded around as currency

2. you never hear about security probems (incidents or exposure) in the platforms, there's no transparency

3. better than most coporate stuff is a low bar

5 more replies

gauravphoenix1y ago

one of my greatest learnings in life is to differentiate between facts and opinions- sometimes opinions are presented as facts and vice-versa. if you think about it- the statement "this is false" is a response to an opinion (presented as a fact) but not a fact. there is no way one can objectively define and defend what does "real technical understanding" means. the cloud space is vast with millions of people having varied understanding and thus opinions.

so let's not fight the battle that will never be won. there is no point in convincing pro-cloud people that cloud isn't the right choice and vice-versa. let people share stories where it made sense and where it didn't.

as someone who has lived in cloud security space since 2009 (and was founder of redlock - one of the first CSPMs), in my opinion, there is no doubt that AWS is indeed superiorly designed than most corp. networks- but is that you really need? if you run entire corp and LOB apps on aws but have poor security practices, will it be right decision? what if you have the best security engineers in the world but they are best at Cisco type of security - configuring VLANS and managing endpoints but are not good at detecting someone using IMDSv1 in ec2 exposed to the internet and running a vulnerable (to csrf) app?

when the scope of discussion is as vast as cloud vs on-prem, imo, it is a bad idea to make absolute statements.

1 more reply

rmbyrro1y ago

about security, most businesses using AWS invest little to nothing in securing their software, or even adopt basic security practices for their employees

having the most secure data center doesn't matter if you load your secrets as env vars in a system that can be easily compromised by a motivated attacker

so i don't buy this argument as a general reason pro-cloud

1 more reply

j451y ago

The cloud is someone else’s computer.

It’s like putting something in someone’s desk drawer under the guise of convenience at the expense of security.

Why?

Too often, someone other than the data owner has or can get access to the drawer directly or indirectly.

Also, Cloud vs self hosted to me is a pendulum that has swung back and forth for a number of reasons.

The benefits of the cloud outlined here are often a lot of open source tech packaged up and sold as manageable from a web browser, or a command line.

One of the major reasons the cloud became popular was networking issues in Linux to manage volume at scale. At the time the cloud became very attractive for that reason, plus being able to virtualize bare metal servers to put into any combination of local to cloud hosting.

Self-hosting has become easier by an order of magnitude or two for anyone who knew how to do it, except it’s something people who haven’t done both self-hosting and cloud can really discuss.

Cloud has abstracted away the cost of horsepower, and converted it to transactions. People are discovering a fraction of the horsepower is needed to service their workloads than they thought.

At some point the horsepower got way beyond what they needed and it wasn’t noticed. But paying for a cloud is convenient and standardized.

Company data centres can be reasonably secured using a number of PaaS or IaaS solutions readily available off the shelf. Tools from VMware, Proxmox and others are tremendous.

It may seem like there’s a lot to learn, except most problems they are new to someone have often been thought of a ton by both people with and without experience that is beyond cloud only.

3 more replies

Aachen1y ago

AWS is so complicated, we usually find more impactful permission problems than in any company using their own hardware

dehrmann1y ago

The other part is that when us-east-1 goes down, you can blame AWS, and a third of your customer's vendors will be doing the same. When you unplug the power to your colo rack while installing a new server, that's on you.

3 more replies

evantbyrne1y ago

Making API calls from a VM on shared hardware to KMS is vastly more secure than doing AES locally? I'm skeptical to say the least.

1 more reply

psychoslave1y ago

Taking for granted all these points. How many businesses out there actually need this kind of security/scalability, compared to how many use cloud services and pay extra cost for something they don't need?

wslh1y ago

From a critical perspective, your comment made me think about the risks posed by rogue IT personnel, especially at scale in the cloud. For example, Fastmail is a single point of failure as a DoS target, whereas attacking an entire datacenter can impact multiple clients simultaneously. It all comes down to understanding the attack vectors.

1 more reply

gooosle1y ago

likeabatterycar1y ago

AWS hires the same cretins that inhabit every other IT department, they just usually happen to be more technically capable. That doesn't make them any more or less trustworthy or reliable.

1 more reply

jandrewrogers1y ago

This trivializes some real issues.

The biggest problem the cloud solves is hardware supply chain management. To realize the full benefits of doing your own build at any kind of non-trivial scale you will need to become an expert in designing, sourcing, and assembling your hardware. Getting hardware delivered when and where you need it is not entirely trivial -- components are delayed, bigger customers are given priority allocation, etc. The technical parts are relatively straightforward; managing hardware vendors, logistics, and delivery dates on an ongoing basis is a giant time suck. When you use the cloud, you are outsourcing this part of the work.

If you do this well and correctly then yes, you will reduce costs several-fold. But most people that build their own data infrastructure do a half-ass job of it because they (understandably) don't want to be bothered with any of these details and much of the nominal cost savings evaporate.

Very few companies do security as well as the major cloud vendors. This isn't even arguable.

On the other hand, you will need roughly the same number of people for operations support whether it is private data infrastructure or the cloud, there is little or no savings to be had here. The fixed operations people overhead scales to such a huge number of servers that it is inconsequential as a practical matter.

It also depends on your workload. The types of workloads that benefit most from private data infrastructure are large-scale data-intensive workloads. If your day-to-day is sling tens or hundreds of PB of data for analytics, the economics of private data infrastructure is extremely compelling.

gizzlon1y ago

> managing hardware vendors, logistics, and delivery dates on an ongoing basis is a giant time suck

You can rent servers and it's still not cloud.

I'm pretty neutral and definitely see the value of cloud. But a lot of cloud proponents seem to lack, what to me, seems like basic knowledge.

listenallyall1y ago

> don't want to be bothered with any of these details

Isn't the job to be bothered with the details? 90% of employment for most people is doing shit you don't really want to be doing, but that's the job.

tomrod1y ago

<ctoHatTime> Dunno man, it's really really easy to set up an S3 and use it to share datasets for users authorized with IAM....

And IAM and other cloud security and management considerations is where the opex/capex and capability argument can start to break down. Turns out, the "cloud" savings comes from not having capabilities in house to manage hardware. Sometimes, for most businesses, you want some of that lovely reliability.

(In short, I agree with you, substantially).

Like code. It is easy to get something basic up, but substantially more resources are needed for non-trivial things.

hamandcheese1y ago

I feel like IAM may be the sleeper killer-app of cloud.

I self-host a lot of things, but boy oh boy if I were running a company it would be a helluvalotta work to get IAM properly set up.

2 more replies

ttul1y ago

My firm belief after building a service at scale (tens of millions of end users, > 100K tps) is that AWS is unbeatable. We don’t even think about building our own infrastructure. There’s no way we could ever make it reliable enough, secure enough, and future-proof enough to ever pay back the cost difference.

Something people neglect to mention when they tout their home grown cloud is that AWS spends significant cycles constantly eliminating technical debt that would absolutely destroy most companies - even ones with billion dollar services of their own. The things you rely on are constantly evolving and changing. It’s hard enough to keep up at the high level of a SaaS built on top of someone else’s bulletproof cloud. But imagine also having to keep up with the low level stuff like networking and storage tech?

No thanks.

balex1y ago

I've done it. It's nowhere as complicated as you make it seem. It definitely doesn't kill - no more than failing to manage your software tech debt. In fact, the latter is both harder keep up with and more risky, because it changes faster than the low level stuff, to support business needs.

With the cloud you have IT/DevOps deal only with scaling the software components of the infra. When doing on-prem they take on the physical layer as well. Do you have enough trust in them to scale the physical part where needed?

watsocd1y ago

...and power, backup power, HVAC, physical security...

1 more reply

swiftcoder1y ago

> All the pro-cloud talking points... don't persuade anyone with any real technical understanding

This is a very engineer-centric take. The cloud has some big advantages that are entirely non-technical:

- You don't need to pay for hardware upfront. This is critical for many early-stage startups, who have no real ability to predict CapEx until they find product/market fit.

- You have someone else to point the SOC2/HIPAA/etc auditors at. For anyone launching a company in a regulated space, being able to checkbox your entire infrastructure based on AWS/Azure/etc existing certifications is huge.

shortsunblack1y ago

You can over-provision your own baremetal resources 20x and it will be still cheaper than cloud. The capex talking point is just that, a talking point.

1 more reply

rakoo1y ago

> You have someone else to point the SOC2/HIPAA/etc auditors at.

I would assume you still need to point auditors to your software in any case

1 more reply

zosima1y ago

Cloud expands the capabilities of what one team can manage by themselves, enabling them to avoid a huge amount of internal politics.

This is worth astronomical amounts of money in big corps.

sgarland1y ago

I’m not convinced this is entirely true. The upfront cost if you don’t have the skills, sure – it takes time to learn Linux administration, not to mention management tooling like Ansible, Puppet, etc.

But once those are set up, how is it different? AWS is quite clear with their responsibility model that you still have to tune your DB, for example. And for the setup, just as there are Terraform modules to do everything under the sun, there are Ansible (or Chef, or Salt…) playbooks to do the same. For both, you _should_ know what all of the options are doing.

The only way I see this sentiment being true is that a dev team, with no infrastructure experience, can more easily spin up a lot of infra – likely in a sub-optimal fashion – to run their application. When it inevitably breaks, they can then throw money at the problem via vertical scaling, rather than addressing the root cause.

6 more replies

daemonologist1y ago

Our big company locked all cloud resources behind a floating/company-wide DevOps team (git and CI too). We have an old on-prem server that we jealously guard because it allows us to create remotes for new git repos and deploy prototypes without consulting anyone.

(To be fair, I can see why they did it - a lot of deployments were an absolute mess before.)

mark2421y ago

This is absolutely spot on.

What do you mean, I can't scale up because I've used my hardware capex budget for the year?

acedTrex1y ago

I have said for years the value of cloud is mainly its api, thats the selling point in large enterprise.

1 more reply

ksec1y ago

Even as an Anti-Cloud ( Or more accurately Anti-everything Cloud ) person I still think there are many benefits to cloud. Just most of the them are over sold and people dont need it.

Number one is company bureaucracy and politics. No one wants to beg another person or department, go on endless meetings just to have extra hardware provisioned. For engineers that alone is worth perhaps 99% of all current cloud margins.

Number two is also company bureaucracy and politics. CFOs dont like CapX. Turning it into OpeX makes things easier for them. Along with end of year company budget turning into Cloud credits for different departments. Especially for companies with government fundings.

Number three is really company bureaucracy and politics. Dealing with either Google, AWS and Microsoft meant you no longer have to deal with dozens of different vendors from on server, networking hardware, software licenses etc. Instead it is all pre-approved into AWS, GCP or Azure. This is especially useful for things that involves Government contracts or fundings.

There are also things like instant worldwide deployment. You can have things up and running in any regions within seconds. And useful when you have site that gets 10 to 1000x the normal traffic from time to time.

But then a lot of small business dont have these sort of issues. Especially non-consumer facing services. Business or SaaS are highly unlikely to get 10x more customers within short period of time.

I continue to wish there is a middle ground somewhere. You rent dedicated server for cheap as base load and use cloud for everything else.

necovek1y ago

But isn't using Fastmail akin to using a cloud provider (managed email vs managed everything else)? They are similarly a service provider, and as a customer, you don't really care "who their ISP is?"

The discussion matters when we are talking about building things: whether you self-host or use managed services is a set of interesting trade-offs.

citrin_ru1y ago

Yes, FastMail is a SAAS. But there adepts of a religion which would tell you that companies like FastMail should be built on top of AWS and it is the only true way. It is good to have some counter narrative to this.

1 more reply

cpursley1y ago

The fact is, managing your own hardware is a pita and a distraction from focusing on the core product. I loathe messing with servers and even opt for "overpriced" paas like fly, render, vercel. Because every minute messing with and monitoring servers is time not spent on product. My tune might change past a certain size and a massive cloud bill and there's room for full time ops people, but to offset their salary, it would have to be huge.

noprocrasted1y ago

That argument makes sense for PaaS services like the ones you mention. But for bare "cloud" like AWS, I'm not convinced it is saving any effort, it's merely swapping one kind of complexity with another. Every place I've been in had full-time people messing with YAML files or doing "something" with the infrastructure - generally trying to work around the (self-inflicted) problems introduced by their cloud provider - whether it's the fact you get 2010s-era hardware or that you get nickel & dimed on absolutely arbitrary actions that have no relationship to real-world costs.

1 more reply

sgarland1y ago

Counterpoint: if you’re never “messing with servers,” you probably don’t have a great understanding of how their metrics map to those of your application’s, and so if you bottleneck on something, it can be difficult to figure out what to fix. The result is usually that you just pay more money to vertically scale.

To be fair, you did say “my tune might change past a certain size.” At small scale, nothing you do within reason really matters. World’s worst schema, but your DB is only seeing 100 QPS? Yeah, it doesn’t care.

1 more reply

cpursley1y ago

Anecdotal - but I once worked for a company where the product line I built for them after acquisition was delayed by 5 months because that's how long it took to get the hardware ordered and installed in the datacenter. Getting it up on AWS would have been a days work, maybe two.

2 more replies

icedchai1y ago

Writing piles of IaC code like Terraform and CloudFormation is also a PITA and a distraction from focusing on your core product.

PaaS is probably the way to go for small apps.

2 more replies

fhd21y ago

I'm with you there, with stuff like fly.io, there's really no reason to worry about infrastructure.

AWS, on the other hand, seems about as time consuming and hard as using root servers. You're at a higher level of abstraction, but the complexity is about the same I'd say. At least that's my experience.

1 more reply

xorcist1y ago

> every minute messing with and monitoring servers

You're not monitoring your deployments because "cloud"?

TacticalCoder1y ago

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding ...

And moreover most of the actual interesting things, like having VM templates and stateless containers, orchestration, etc. is very easy to run yourself and gets you 99.9% of the benefits of the cloud.

About just any and every service is available as container file already written for you. And if it doesn't exist, it's not hard to plumb up.

A friend of mine runs more than 700 containers (yup, seven hundreds), split over his own rack at home (half of them) and the other half on dedicated servers (he runs stuff like FlightRadar, AI models, etc.). He'll soon get his own IP addresses space. Complete "chaos monkey" ready infra where you can cut any cable and the thing shall keep working: everything is duplicated, can be spun up on demand, etc. Someone could still his entire rack and all his dedicated server, he'd still be back operational in no time.

If an individual can do that, a company, no matter its size, can do it too. And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.

And another thing: there's even two in-betweens between "cloud" and "our own hardware located at our company". First is colocating your own hardware but in a datacenter. Second is renting dedicated servers from a datacenter.

They're often ready to accept cloud-init directly.

And it's not hard. I'd say learning to configure hypervisors on bare metal, then spin VMs from templates, then running containers inside the VMs is actually much easier than learning all the idiosyncrasies of all the different cloud vendors APIs and whatnots.

Funnily enough when the pendulum swung way too far on the "cloud all the things" side, those saying at some point we'd read story about repatriation were being made fun of.

sgarland1y ago

> If an individual can do that, a company, no matter its size, can do it too.

Fully agreed. I don't have physical HA – if someone stole my rack, I would be SOL – but I can easily ride out a power outage for as long as I want to be hauling cans of gasoline to my house. The rack's UPS can keep it up at full load for at least 30 minutes, and I can get my generator running and hooked up in under 10. I've done it multiple times. I can lose a single server without issue. My only SPOF is internet, and that's only by choice, since I can get both AT&T and Spectrum here, and my router supports dual-WAN with auto-failover.

> And arguably 99.9% of all the companies out there don't have the need for an infra as powerful as the one most homelab enthusiast have.

THIS. So many people have no idea how tremendously fast computers are, and how much of an impact latency has on speed. I've benchmarked my 12-year old Dells against the newest and shiniest RDS and Aurora instances on both MySQL and Postgres, and the only ones that kept up were the ones with local NVMe disks. Mine don't even technically have _local_ disks; they're NVMe via Ceph over Infiniband.

Does that scale? Of course not; as soon as you want geo-redundant, consistent writes, you _will_ have additional latency. But most smaller and medium companies don't _need_ that.

RainyDayTmrw1y ago

I hear this debate repeated often, and I think there's another important factor. It took me some time to figure out how to explain it, and the best I came up with was this: It is extremely difficult to bootstrap from zero to baseline competence, in general, and especially in an existing organization.

In particular, there is a limit to paying for competence, and paying more money doesn't automatically get you more competence, which is especially perilous if your organization lacks the competence to judge competence. In the limit case, this gets you the Big N consultancies like PWC or EY. It's entirely reasonable to hire PWC or EY to run your accounting or compliance. Hiring PWC or EY to run your software development lifecycle is almost guaranteed doom, and there is no shortage of stories on this site to support that.

In comparison, if you're one of these organizations, who don't yet have baseline competence in technology, then what the public cloud is selling is nothing short of magical: You pay money, and, in return, you receive a baseline set of tools, which all do more or less what they say they will do. If no amount of money would let you bootstrap this competence internally, you'd be much more willing to pay a premium for it.

As an anecdote, my much younger self worked in mid-sized tech team in a large household brand in a legacy industry. We were building out a web product that, for product reasons, had surprisingly high uptime and scalability requirements, relative to legacy industry standards. We leaned heavily on public cloud and CDNs. We used a lot of S3 and SQS, which allowed us to build systems with strong reliability characteristics, despite none of us having that background at the time.

dan-robertson1y ago

Well cloud providers often give more than just VMs in a data enter somewhere. You may not be able to find good equivalents if you aren’t using the cloud. Some third-party products are also only available on clouds. How much of a difference those things make will depend on what you’re trying to do.

I think there are accounting reasons for companies to prefer paying opex to run things on the cloud instead of more capex-intensive self-hosting, but I don’t understand the dynamics well.

It’s certainly the case that clouds tend to be more expensive than self-hosting, even when taking account of the discounts that moderately sized customers can get, and some of the promises around elastic scaling don’t really apply when you are bigger.

To some of your other points: the main customers of companies like AWS are businesses. Businesses generally don’t care about the centralisation of the internet. Businesses are capable of reading the contracts they are signing and not signing them if privacy (or, typically more relevant to businesses, their IP) cannot be sufficiently protected. It’s not really clear to me that using a cloud is going to be less secure than doing things on-prem.

1 more reply

motorest1y ago

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding,(...)

This is where you lose all credibility.

I'm going to focus on a single aspect: performance. If you're serving a global user base and your business, like practically all online businesses, is greatly impacted by performance problems, the only solution to a physics problem is to deploy your application closer to your users.

With any cloud provider that's done with a few clicks and an invoice of a few hundred bucks a month. If you're running your hardware... What solution do you have to show for? Do you hope to create a corporate structure to rent a place to host your hardware manned by a dedicated team? What options f you have?

stefan_1y ago

Is everyone running online FPS gaming servers now? If you want your page to load faster, tell your shitty frontend engineers to use less of the latest frameworks. You are not limited by physics, 99% aren't.

I ping HN, it's 150ms away, it still renders in the same time that the Google frontpage does and that one has a 130ms advantage.

1 more reply

noprocrasted1y ago

The complexity of scaling out an application to be closer to the users has never been about getting the hardware closer. It's always about how do you get the data there and dealing with the CAP theorem, which requires hard tradeoffs to be decided on when designing the application and can't be just tacked on - there is no magic button to do this, in the AWS console or otherwise.

Getting the hardware closer to the users has always been trivial - call up any of the many hosting providers out there and get a dedicated server, or a colo and ship them some hardware (directly from the vendor if needed).

johnklos1y ago

> This is where you lose all credibility.

People who write that, well...

If you're greatly impacted by performance problems, how does that become a physics problem that has as a solution which is being closer to your users?

I think you're mixing up your sales points. One, how do you scale hardware? Simple: you buy some more, and/or you plan for more from the beginning.

How do you deal with network latency for users on the other side of the planet? Either you plan for and design for long tail networking, and/or you colocate in multiple places, and/or you host in multiple places. Being aware of cloud costs, problems and limitations doesn't mean you can't or shouldn't use cloud at all - it just means to do it where it makes sense.

You're making my point for me - you've got emotional generalizations ("you lose all credibility"), you're using examples that people use often but that don't even go together, plus you seem to forget that hardly anyone advocates for all one or all the other, without some kind of sensible mix. Thank you for making a good example of exactly what I'm talking about.

jread1y ago

If have a global user base, depending on your workload, a simple CDN in front of your hardware can often go a long ways with minimal cost and complexity.

1 more reply

swozey1y ago

I have about 30 years as a linux eng, starting with openbsd and have spent a LOT of time with hardware building webhosts and CDNs until about 2020 where my last few roles have been 100% aws/gcloud/heroku.

I love building the cool edge network stuff with expensive bleeding edge hardware, smartnics, nvmeOF, etc but its infinitely more complicated and stressful than terraforming an AWS infra. Every cluster I set up I had to interact with multiple teams like networking, security, storage sometimes maintenance/electrical, etc. You've got some random tech you have to rely on across the country in one of your POPs with a blown server. Every single hardware infra person has had a NOC tech kick/unplug a server at least once if they've been in long enough.

And then when I get the hardware sometimes you have different people doing different parts of setup, like NOC does the boot, maybe boostraps the hardware with something that works over ssh before an agent is installed (ansible, etc), then your linux eng invokes their magic with a ton of bash or perl, then your k8s person sets up the k8s clusters with usually something like terraform/puppet/chef/salt probably calling helm charts. Then your monitoring person gets it into OTEL/grafana, etc. This all organically becomes more automated as time goes on, but I've seen it from a brand new infra where you've got no automation many times.

Now you're automating 90% of this via scripts and IAC, etc, but you're still doing a lot of tedious work.

You also have a much more difficult time hiring good engineers. The markets gone so heavily AWS (I'm no help) that its rare that I come across an ops resume that's ever touched hardware, especially not at the CDN distributed systems level.

So.. aws is the chill infra that stays online and you can basically rely on 99.99something%. Get some terraform blueprints going and your own developers can self serve. Don't need hardware or ops involved.

And none of this is even getting into supporting the clusters. Failing clusters. Dealing with maintenance, zero downtime kernel upgrades, rollbacks, yaddayadda.

samcat1161y ago

This 1000%. There are so many cool networking/virtualization/hardware things I love dealing with. But the stress of doing ceph upgrades isn't the right trade off usually.

browningstreet1y ago

Most companies severely understaff ops, infra, and security. Your talking points might be good but, in practice, won’t apply in many cases because of the intractability of that management mindset. Even when they should know better.

I’ve worked at tech companies with hundreds of developers and single digit ops staff. Those people will struggle to build and maintain mature infra. By going cloud, you get access to mature infra just by including it in build scripts. Devops is an effective way to move infra back to project teams and cut out infra orgs (this isn’t great but I see it happen everywhere). Companies will pay cloud bills but not staffing salaries.

chii1y ago

It's the exact same reason why most companies don't just run their own power stations, and instead buy it from a power company.

Computation has become a utility these days - this includes the fat ISP lines and connectivity etc, not just the CPU and harddrives. These things have economies of scale that smaller companies cannot truly reach, and will pay a huge fixed cost if they want state of the art management, monitoring and redundancy. So unless you are a massive consumer, just like power stations, you really don't need nor want to build your own.

j451y ago

Using a commercial cloud provider only cements understaffing in, in too many cases.

awholescammy1y ago

There is a whole ecosystem that pushes cloud to ignorant/fresh graduates/developers. Just take a look at the sponsors for all the most popular frameworks. When your system is super complex and depends on the cloud they make more money. Just look at the PHP ecosystem, Laravel needs 4 times the servers to server something that a pure PHP system would need. Most projects don't need the cloud. Only around 10% of projects actually need what the cloud provides. But they were able to brainwash a whole generation of developers/managers to think that they do. And so it goes.

gjsman-10001y ago

Having worked with Laravel, this is absolutely bull.

twoparachute451y ago

>What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points. They don't seem to care much about data or facts, just that they love cloud and want everyone else to be in cloud, too.

The irony is absolutely dripping off this comment, wow.

Commenter makes emotionally charge comment with no data or facts and decries anyone who disagrees with them as "silly talking points" for not caring about data and facts.

Your comment is entirely talking about itself.

cookiengineer1y ago

My take on this whole cloud fatigue is that system maintenance got overly complex over the last couple years/decades. So much that management people now think that it's too expensive in terms of hiring people that can do it compared to the higher managed hosting costs.

DevOps and kubernetes come to mind. A lot of people using kubernetes don't know what they're getting into, and k0s or another single machine solution would have been enough for 99% of SMEs.

In terms of cyber security (my field) everything got so ridiculously complex that even the folks that use 3 different dashboards in parallel will guess the answers as to whether or not they're affected by a bug/RCE/security flaw/weakness because all of the data sources (even the expensively paid for ones) are human-edited text databases. They're so buggy that they even have Chinese idiom symbols instead of a dot character in the version fields without anyone ever fixing it upstream in the NVD/CVE process.

I started to build my EDR agent for POSIX systems specifically, because I hope that at some point this can help companies to ditch the cloud and allows them to selfhost again - which in return would indirectly prevent 13 year old kids like from LAPSUS to pwn major infrastructure via simple tech support hotline calls.

When I think of it in terms of hosting, the vertical scalability of EPYC machines is so high that most of the time when you need its resources you are either doing something completely wrong and you should refactor your code or you are a video streaming service.

tzs1y ago

There was a time when cloud was significantly cheaper then owning.

I'd expect that there are people who moved to the cloud then, and over time started using services offered by their cloud provider (e.g., load balancers, secret management, databases, storage, backup) instead of running those services themselves on virtual machines, and now even if it would be cheaper to run everything on owned servers they find it would be too much effort to add all those services back to their own servers.

toomuchtodo1y ago

The cloud wasn’t about cheap, it was about fast. If you’re VC funded, time is everything, and developer velocity above all else to hyperscale and exit. That time has passed (ZIRP), and the public cloud margin just doesn’t make sense when you can own and operate (their margin is your opportunity) on prem with similar cloud primitives around storage and compute.

Elasticity is a component, but has always been from a batch job bin packing scheduling perspective, not much new there. Before k8s and Nomad, there was Globus.org.

(Infra/DevOps in a previous life at a unicorn, large worker cluster for a physics experiment prior, etc; what is old is a new again, you’re just riding hype cycle waves from junior to retirement [mainframe->COTS on prem->cloud->on prem cloud, and so on])

dboreham1y ago

That was never true except in the case that the required hardware resources were significantly smaller than a typical physical machine.

mjburgess1y ago

1. People are credulous

2. People therefore repeat talking points which seem in their interest

3. With enough repetition these become their beliefs

4. People will defend their beliefs as theirs against attack

5. Goto 1

onli1y ago

The one convincing argument from technical people I saw, that would be repeated to your comment, is that by now, you dont find enough experienced engineers to reliably setup some really big systems. Because so much went to the cloud, a lot of the knowledge is buried there.

That came from technical people who I didn't perceive as being dogmatically pro-cloud.

tyingq1y ago

I think part of it was a way for dev teams to get an infra team that was not empowered to say no. Plus organizational theory, empire building, etc.

sgarland1y ago

Yep. I had someone tell me last week that they didn't want a more rigid schema because other teams rely on it, and anything adding "friction" to using it would be poorly received.

As an industry, we are largely trading correctness and performance for convenience, and this is not seen as a negative by most. What kills me is that at every cloud-native place I've worked at, the infra teams were both responsible for maintaining and fixing the infra that product teams demanded, but were not empowered to push back on unreasonable requests or usage patterns. It's usually not until either the limits of vertical scaling are reached, or a SEV0 occurs where these decisions were the root cause does leadership even begin to consider changes.

mmcwilliams1y ago

It seems that the preference is less about understanding or misunderstanding the technical requirements but more that it moves a capital expenditure with some recurring operational expenditure entirely into the opex column.

glitchc1y ago

Cloud solves one problem quite well: Geographic redundancy. It's extremely costly with on-prem.

sgarland1y ago

Only if you’re literally running your own datacenters, which is in no way required for the majority of companies. Colo giants like Equinix already have the infrastructure in place, with a proven track record.

If you enable Multi-AZ for RDS, your bill doubles until you cancel. If you set up two servers in two DCs, your initial bill doubles from the CapEx, and then a very small percentage of your OpEx goes up every month for the hosting. You very, very quickly make this back compared to cloud.

1 more reply

icedchai1y ago

Except, almost nobody, outside of very large players, does cross region redundancy. us-east-1 is like a SPOF for the entire Internet.

liontwist1y ago

Cloud noob here. But if I have a central database what can I distribute across geographic regions? Static assets? Maybe a cache?

1 more reply

dietr1ch1y ago

Does it? I've seen outages around "Sorry, us-west_carolina-3 is down". AWS is particularly good at keeping you aware of their datacenters.

2 more replies

ayuhito1y ago

My company used to do everything on-prem. Until a literal earthquake and tsunami took down a bunch of systems.

After that, yeah we’ll let AWS do the hard work of enabling redundancy for us.

jhwhite1y ago

> It makes me wonder: how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?

I feel like this can be applied to anything.

I had a manager take one SAFe for Leaders class then came back wanting to implement it. They had no previous AGILE classes or experience. And the Enterprise Agile Office was saying DON'T USE SAFe!!

But they had one class and that was the only way they would agree to structure their group.

jeffbee1y ago

The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis. I reject a theory that requires that, because my ego just isn't that large.

I once worked for several years at a publicly traded firm well-known for their return-to-on-prem stance, and honestly it was a complete disaster. The first-party hardware designs didn't work right because they didn't have the hardware designs staffing levels to have de-risked to possibility that AMD would fumble the performance of Zen 1, leaving them with a generation of useless hardware they nonetheless paid for. The OEM hardware didn't work right because they didn't have the chops to qualify it either, leaving them scratching their heads for months over a cohort of servers they eventually discovered were contaminated with metal chips. And, most crucially, for all the years I worked there, the only thing they wanted to accomplish was failover from West Coast to East Coast, which never worked, not even once. When I left that company they were negotiating with the data center owner who wanted to triple the rent.

These experiences tell me that cloud skeptics are sometimes missing a few terms in their equations.

floating-io1y ago

"Vendor problems" is a red herring, IMO; you can have those in the cloud, too.

It's been my experience that those who can build good, reliable, high-quality systems, can do so either in the cloud or on-prem, generally with equal ability. It's just another platform to such people, and they will use it appropriately and as needed.

Those who can only make it work in the cloud are either building very simple systems (which is one place where the cloud can be appropriate), or are building a house of cards that will eventually collapse (or just cost them obscene amounts of money to keep on life support).

Engineering is engineering. Not everyone in the business does it, unfortunately.

Like everything, the cloud has its place -- but don't underestimate the number of decisions that get taken out of the hands of technical people by the business people who went golfing with their buddy yesterday. He just switched to Azure, and it made his accountants really happy!

The whole CapEx vs. OpEx issue drives me batty; it's the number one cause of cloud migrations in my career. For someone who feels like spent money should count as spent money regardless of the bucket it comes out of, this twists my brain in knots.

I'm clearly not a finance guy...

2 more replies

marcosdumay1y ago

> The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis.

Yes. Mass psychosis explains an incredible number of different and apparently unrelated problems with the industry.

noprocrasted1y ago

There's however a middle-ground between run your own colocated hardware and cloud. It's called "dedicated" servers and many hosting providers (from budget bottom-of-the-barrel to "contact us" pricing) offer it.

Those take on the liability of sourcing, managing and maintaining the hardware for a flat monthly fee, and would take on such risk. If they make a bad bet purchasing hardware, you won't be on the hook for it.

This seems like a point many pro-cloud people (intentionally?) overlook.

johnklos1y ago

> The problem with your claims here is they can only be right if the entire industry is experiencing mass psychosis.

What's the market share of Windows again? ;)

1 more reply

hnthrowaway65431y ago

> a desire to not centralize the Internet

> If I didn't already self-host email

this really says all that needs to be said about your perspective. you have an engineer and OSS advocate's mindset. which is fine, but most business leaders (including technical leaders like CTOs) have a business mindset, and their goal is to build a business that makes money, not avoid contributing to the centralization of the internet

lelanthran1y ago

> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost

From a cost PoV, sure, but when you're taking money out of capex it represents a big hit to the cash flow, while taking out twice that amount from opex has a lower impact on the company finances.

moltar1y ago

Cloud is more than instances. If all you need is a bunch of boxes, then cloud is a terrible fit.

I use AWS cloud a lot, and almost never use any VMs or instances. Most instances I use are along the lines of a simple anemic box for a bastion host or some such.

I use higher level abstractions (services) to simplify solutions and outsource maintenance of these services to AWS.

anotherhue1y ago

They spent time and career points learning cloud things and dammit it's going to matter!

You can't even blame them too much, the amount of cash poured into cloud marketing is astonishing.

sgarland1y ago

The thing that frustrates me is it’s possible to know how to do both. I have worked with multiple people who are quite proficient in both areas.

Cloud has definite advantages in some circumstances, but so does self-hosting; moreover, understanding the latter makes the former much, much easier to reason about. It’s silly to limit your career options.

1 more reply

bluedino1y ago

I want to see an article like this, but written from a Fortune 500 CTO perspective

It seems like they all abandoned their VMware farms or physical server farms for Azure (they love Microsoft).

Are they actually saving money? Are things faster? How's performance? What was the re-training/hiring like?

In one case I know we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc

And the developers (and many of the admins) we had knew nothing about hardware or anything so keeping the physical hardware around probably wouldn't have made sense anyways

ndriscoll1y ago

Complicating this analysis is that computers have still been making exponential improvements in capability as clouds became popular (e.g. disks are 1000-10000x faster than they were 15 years ago), so you'd naturally expect things to become easier to manage over time as you need fewer machines, assuming of course that your developers focus on e.g. learning how to use a database well instead of how to scale to use massive clusters.

That is, even if things became cheaper/faster, they might have been even better without cloud infrastructure.

jrs2351y ago

>we got rid of our old database greybeards and replaced them with "DevOps" people that knew nothing about performance etc

Seems a lot of those DevOps people just see Azures recommendations for adding indexes and either just allow auto applying them or just adding them without actually reviewing it understanding what use loads require them and why. This also lands a bit on developers/product that don't critically think about and communicate what queries are common and should have some forethought on what indexes should be beneficial and created. (Yes followup monitoring of actual index usage and possible missing indexes is still needed.) Too many times I've seen dozens of indexes on tables in the cloud where one could cover all of them. Yes, there still might be worthwhile reasons to keep some narrower/smaller indexes but again DBA and critical query analysis seems to be a forgotten and neglected skill. No one owns monitoring and analysing db queries and it only comes up after a fire has already broken out.

dehrmann1y ago

The real cost wins of self-hosted are that anything using new hardware becomes an ordeal, and engineers won't use high-cost, value-added services. I agree that there's often too little restraint in cloud architectures, but if a business truly believes in a project, it shouldn't be held up for six months waiting for server budget with engineers spending doing ops work to get three nines of DB reliability.

There is a size where self-hosting makes sense, but it's much larger than you think.

sanderjd1y ago

Also, by the way, I found it interesting that you framed your side of this disagreement as the technically correct one, but then included this:

> a desire to not centralize the Internet

This is an ideological stance! I happen to share this desire. But you should be aware of your own non-technical - "emotional" - biases when dismissing the arguments of others on the grounds that they are "emotional" and+l "fanatical".

johnklos1y ago

I never said that my own reasons were neither personal nor emotional. I was just pointing out that my reasons are easy to articulate.

I do think it's more than just emotional, though, but most people, even technical people, haven't taken the time to truly consider the problems that will likely come with centralization. That's a whole separate discussion, though.

ants_everywhere1y ago

...but your post reads like you do have an emotional reaction to this question and you're ready to believe someone who shares your views.

There's not nearly enough in here to make a judgment about things like security or privacy. They have the bare minimum encryption enabled. That's better than nothing. But how is key access handled? Can they recover your email if the entire cluster goes down? If so, then someone has access to the encryption keys. If not, then how do they meet reliability guarantees?

Three letter agencies and cyber spies like to own switches and firewalls with zero days. What hardware are they using, and how do they mitigate against backdoors? If you really cared about this you would have to roll your own networking hardware down to the chips. Some companies do this, but you need to have a whole lot of servers to make it economical.

It's really about trade-offs. I think the big trade-offs favoring staying off cloud are cost (in some applications), distrust of the cloud providers,and avoiding the US Government.

The last two are arguably judgment calls that have some inherent emotional content. The first is calculable in principle, but people may not be using the same metrics. For example if you don't care that much about security breaches or you don't have to provide top tier reliability, then you can save a ton of money. But if you do have to provide those guarantees, it would be hard to beat Cloud prices.

sgarland1y ago

> What's particularly fascinating to me, though, is how some people are so pro-cloud that they'd argue with a writeup like this with silly cloud talking points.

I’m sure I’ll be downvoted to hell for this, but I’m convinced that it’s largely their insecurities being projected.

Running your own hardware isn’t tremendously difficult, as anyone who’s done it can attest, but it does require a much deeper understanding of Linux (and of course, any services which previously would have been XaaS), and that’s a vanishing trait these days. So for someone who may well be quite skilled at K8s administration, serverless (lol) architectures, etc. it probably is seen as an affront to suggest that their skill set is lacking something fundamental.

TacticalCoder1y ago

> So for someone who may well be quite skilled at K8s administration ...

And running your own hardware is not incompatible with Kubernetes: on the contrary. You can fully well have your infra spin up VMs and then do container orchestration if that's your thing.

And part your hardware monitoring and reporting tool can work perfectly fine from containers.

Bare metal -> Hypervisor -> VM -> container orchestration -> a container running a "stateless" hardware monitoring service. And VMs themselves are "orchestrated" too. Everything can be automated.

Anyway say a harddisk being to show errors? Notifications being sent (email/SMS/Telegram/whatever) by another service in another container, dashboard shall show it too (dashboards are cool).

Go to the machine once the spare disk as already been resilvered, move it where the failed disk was, plug in a new disk that becomes the new spare.

Boom, done.

I'm not saying all self-hosted hardware should do container orchestration: there are valid use cases for bare metal too.

But something as to be said about controlling everything on your own infra: from the bare metal to the VMs to container orchestration. To even potentially your own IP address space.

This is all within reach of an individual, both skill-wise and price-wise (including obtaining your own IP address space). People who drank the cloud kool-aid should ponder this and wonder how good their skills truly are if they cannot get this up and working.

3 more replies

luplex1y ago

In the public sector, cloud solves the procurement problem. You just need to go through the yearlong process once to use a cloud service, instead of for each purchase > 1000€.

kevin_thibedeau1y ago

Capital expenditures are kryptonite to financial engineers. The cloud selling point was to trade those costs for operational expenses and profit in phase 3.

JOnAgain1y ago

As someone who ran a startup with 100’s of hosts. As soon as I start to count the salaries, hiring, desk space, etc of the people needed to manage the hosts AWS would look cheap again. Yea, hardware costs they are aggressively expensive. But TCO wise, they’re cheap for any decent sized company.

Add in compliance, auditing, etc. all things that you can set up out of the box (PCI, HIPPA, lawsuit retention). Gets even cheaper.

mark2421y ago

I'm curious about what "reasonable amount of hosting" means to you, because from my experience, as your internal network's complexity goes up, it's far better for your to move systems to a hyperscaler. The current estimate is >90% of Fortune 500 companies are cloud-based. What is it that you know that they don't?

slothtrop1y ago

The bottom line > babysitting hardware. Businesses are transitioning to cloud because it's better for business.

irunmyownemail1y ago

Actually, there's been a reversal trend going on, for many companies, better is often on premises or hybrid now.

irunmyownemail1y ago

> If I didn't already self-host email, I'd consider using Fastmail.

Same sentiment all of what you said.

samcat1161y ago

> how do people get so sold on a thing that they'll go online and fight about it, even when they lack facts or often even basic understanding?

Are you new to the internet?

sanderjd1y ago

> All the pro-cloud talking points are just that - talking points that don't persuade anyone with any real technical understanding, but serve to introduce doubt to non-technical people and to trick people who don't examine what they're told.

This feels like "no true scotsman" to me. I've been building software for close to two decades, but I guess I don't have "any real technical understanding" because I think there's a compelling case for using "cloud" services for many (honestly I would say most) businesses.

Nobody is "afraid to openly discuss how cloud isn't right for many things". This is extremely commonly discussed. We're discussing it right now! I truly cannot stand this modern innovation in discourse of yelling "nobody can talk about XYZ thing!" while noisily talking about XYZ thing on the lowest-friction publishing platforms ever devised by humanity. Nobody is afraid to talk about your thing! People just disagree with you about it! That's ok, differing opinions are normal!

Your comment focuses a lot on cost. But that's just not really what this is all about. Everyone knows that on a long enough timescale with a relatively stable business, the total cost of having your own infrastructure is usually lower than cloud hosting.

But cost is simply not the only thing businesses care about. Many businesses, especially new ones, care more about time to market and flexibility. Questions like "how many servers do we need? with what specs? and where should we put them?" are a giant distraction for a startup, or even for a new product inside a mature firm.

Cloud providers provide the service of "don't worry about all that, figure it out after you have customers and know what you actually need".

It is also true that this (purposefully) creates lock-in that is expensive either to leave in place or unwind later, and it definitely behooves every company to keep that in mind when making architecture decisions, but lots of products never make it to that point, and very few of those teams regret the time they didn't spend building up their own infrastructure in order to save money later.

cyberax1y ago

> The whole push to the cloud has always fascinated me. I get it - most people aren't interested in babysitting their own hardware.

For businesses, it's a very typical lease-or-own decision. There's really nothing too special about cloud.

> On the other hand, a business of just about any size that has any reasonable amount of hosting is better off with their own systems when it comes purely to cost.

Nope. Not if you factor-in 24/7 support, geographic redundancy, and uptime guarantees. With EC2 you can break even at about $2-5m a year of cloud spending if you want your own hardware.

favflam1y ago

I did compliance for a fintech under heavy regulation.

If we used AWS, we could skip months of certification. If we use a custom data center, we have to certify it ourselves (muuuuuch more expensive).

From this standpoint, cloud beats on-premise.

fnord771y ago

capex vs opex

lukevp1y ago

To me, Cloud is all about the shift left of DevOps. It’s not a cost play. I’m a Dev Lead / Manager and have worked in both types of environments over the last 10 years. It’s immeasurable the velocity difference as far as system provisioning between the two approaches. In the hardware space, it took months to years to provision new machines or upgrade OSes. In the cloud, it’s a new terraform script and a CI deploy away. Need more storage? It’s just there, available all the time. Need to add a new firewall between machines or redo the network topology? Free. Need a warm standby in 4 different regions that costs almost nothing but can scale to full production capacity within a couple of minutes? Done. Those types of things are difficult to do with physical hardware. And if you have an engineering culture where the operational work and the development work are at odds (think the old style of Dev / QA / Networking / Servers / Security all being separate teams), processes and handoffs eat your lunch and it becomes crippling to your ability to innovate. Cloud and DevOps are to me about reducing the differentiation between these roles so that a single engineer can do any part of the stack, which cuts out the communication overhead and the handoff time and the processes significantly.

If you have predictable workloads, a competent engineering culture that fights against process culture, and are willing to spend the money to have good hardware and the people to man it 24x7x365 then I don’t think cloud makes sense at all. Seems like that’s what y’all have and you should keep up with it.

drdaeman1y ago

> In the hardware space, it took months to years to provision new machines or upgrade OSes.

If it takes this long to manage a machine, I strongly suspect it means that when initially designing the system engineers had failed to account for those for some reason. Was that true in your case?

Back in late '00s until mid '10s, I worked for an ISP startup as a SWE. We had a few core machines (database, RADIUS server, self-service website, etc) - ugly mess TBH - initially provisioned and originally managed entirely by hand as we didn't knew any better back then. Naturally, maintaining those was a major PITA, so they sat on the same dated distro for years. That was before Ansible was a thing, and we haven't really heard about Salt or Chef before we started to feel the pains and started to search for solutions. Virtualization (OpenVZ, then Docker) helped to soften a lot of issues, making it significantly easier to maintain the components, but the pains from our original sins were felt for a long time.

But we also had a fleet of other machines, where we understood our issues with the servers enough to design new nodes to be as stateless as possible, with automatic rollout scripts for whatever we were able to automate. Provisioning a new host took only a few hours, with most time spent unpacking, driving, accessing the server room, and physically connecting things. Upgrades were pretty easy too - reroute customers to another failover node, write a new system image to the old one, reboot, test, re-route traffic back, done.

So it's not like self-owned bare metal is harder to manage - the lesson I learned is that one just gotta think ahead of time what the future would require. Same as the clouds, I guess, one has to follow best practices or they'll end up with crappy architectures that will be painful to rework. Just different set of practices, because of the different nature of the systems.

Jenk1y ago

Exactly this. It is culture and organisation (structure) dependent. I'm in the throes of the same discussion with my leader ship team, some of whom have built themselves an ops/qa/etc. empire and want to keep their moat.

Are you running a well understood and predictable (as in, little change, growth, nor feature additions) system? Are your developers handing over to central platform/infra/ops teams? You'll probably save some cash by buying and owning the hardware you need for your use case(s). Elasticity is (probably) not part of your vocabulary, perhaps outside of "I wish we had it" anyway.

Have you got teams and/or products that are scaling rapidly or unpredictably? Have you still got a lot of learning and experimenting to do with how your stack will work? Do you need flexibility but can't wait for that flexibility? Then cloud is for you.

n.b. I don't think I've ever felt more validated by a post/comment than yours.

comprev1y ago

Our CI pipelines can spin up some seriously meaty hardware, run some very resource intensive tests, and destroy the infrastructure when finished.

Bonus points: they can do it with spot pricing to further lower the bill.

The cloud offers immense flexibility and empowers _developers_ to easily manage their own infrastructure without depending on other teams.

Speed of development is the primary reason $DayJob is moving into the cloud, while maintaining bare-metal for platforms which rarely change.

RainyDayTmrw1y ago

I think I understand your point, and this is not directed at you personally, but: I think "shift left" is another one of those phrases that's lost all meaning, like "synergy" or "agile" before it.

eddsolves1y ago

My first job in tech was building servers for companies when they needed more compute, physically building them from our warehouse of components, driving them to their site, and setting it up in their network.

You could get same day builds deployed on prem with the right support bundle!

_bare_metal1y ago

Plugging https://BareMetalSavings.com

in case you want to ballpark-estimate your move off of the cloud

Bonus points: I'm a Fastmail customer, so it tangentially tracks

----

Quick note about the article: ZFS encryption can be flaky, be sure you know what you're doing before deploying for your infrastructure.

Relevant Reddit discussion: https://www.reddit.com/r/zfs/comments/1f59zp6/is_zfs_encrypt...

A spreadsheet of related issues that I can't remember who made:

https://docs.google.com/spreadsheets/d/1OfRSXibZ2nIE9DGK6sww...

brongondwana1y ago

Yeah, we know about the ZFS encryption with send/receive bug, it's frustrating our attempts to get really nice HA support on our logging system... but so far it appears that just deleting the offsending snapshot and creating a new one works, and we're funding some research into the issue as well.

This is the current script - it runs every minute for each pool synced between the two log servers: https://gist.github.com/brong/6a23fee1480f2d62b8a18ade5aea66...

_bare_metal1y ago

Thanks for sharing!

sneak1y ago

My main issue with ZFS encryption is that it only supports one key.

LUKS2 has something like 9 key slots.

I run ZoL over LUKS2 and it works great.

bartvk1y ago

Such an awesome article. I like how they didn't just go with the Cloud wave but kept sysadmin'ing, like ol' Unix graybeards. Two interesting things they wrote about their SSDs:

1) "At this rate, we’ll replace these [SSD] drives due to increased drive sizes, or entirely new physical drive formats (such E3.S which appears to finally be gaining traction) long before they get close to their rated write capacity."

and

2) "We’ve also anecdotally found SSDs just to be much more reliable compared to HDDs (..) easily less than one tenth the failure rate we used to have with HDDs."

tgv1y ago

To avoid sysadmin tasks, and keep costs down, you've got to go so deep in the cloud, that it becomes just another arcane skill set. I run most of my stuff on virtual Linux servers, but some on AWS, and that's hard to learn, and doesn't transfer to GCP or Azure. Unless your needs are extreme, I think sysadmin'ing is the easier route in most cases.

wongarsu1y ago

For so many things the cloud isn't really easier or cheaper, and most cloud providers stopped advertising it as such. My assumption is that cloud adoption is mainly driven by 3 forces:

- for small companies: free credits

- for large companies: moving prices as far away as possible from the deploy button, allowing dev and it to just deploy stuff without purchase orders

- self-perpetuating due to hype, cv-driven development, and ease of hiring

All of these are decent reasons, but none of them may apply to a company like fastmail

6 more replies

baxtr1y ago

I predict a slow but unstoppable comeback of the sysadmin job over the next 5-10 years.

1 more reply

graemep1y ago

> it becomes just another arcane skill set

Its an arcane skill set with a GUI. It makes it look much easier to learn.

brongondwana1y ago

My beard isn't entirely grey yet!

The new NVMe drives we've only had for a few years, but so far there's only been a single failure across the whole fleet, and we keep spares in stock. It's been very reliable, not like the weeks back in (hmm, 2006? 2007?) the ancient past, when we were losing 15kRPM velociraptors every other day. They had a firmware fault and we eventually got an update which made them reliable, but it was a wild few months.

AndrewDavis1y ago

A few more than one, but it has been a lot less than when we were dealing with spinner. I think I requested about one or two replacements a year, a far cry from the one a week I was doing before.

2 more replies

kwillets1y ago

SSD's are also a bit of an achilles heel for AWS -- they have their own Nitro firmware for wear levelling and key rotations, due to the hazards of multitenant. It's possible for one EC2 tenant to use up all the write cycles and then pass it to another, and encryption with key rotation is required to keep data from leaking across tenant changes. It's also slower.

We had one outage where key rotation had been enabled on reboot, so data partitions were lost after what should have been a routine crash. Overall, for data warehousing, our failure rate on on-prem (DC-hosted) hardware was lower IME.

edward281y ago

The power of Moore's law.

jeffbee1y ago

I don't see how point 2 could have come as a surprise to anyone.

akpa11y ago

The fact that Fastmail work like this, are transparent about what they're up to and how they're storing my email and the fact that they're making logical decisions and have been doing so for quite a long time is exactly the reason I practically trip over myself to pay them for my email. Big fan of Fastmail.

ylee1y ago

I recently officially became a Fastmail user when pobox.com transitioned to Fastmail, and was very impressed with customer service when I had a technical question.

xyst1y ago

They are also active in contributing to cyrus-imap

DarkCrusader21y ago

I have seen a common sentiment that self hosting is almost always better than cloud. What these discussions does not mention is how to effectively run your business applications on this infrastructure.

Things like identity management (AAD/IAM), provisioning and running VMs, deployments. Network side of things like VNet, DNS, securely opening ports etc. Monitoring setup across the stack. There is so much functionalities that will be required to safely expose an application externally that I can't even coherently list them out here. Are people just using Saas for everything (which I think will defeat the purpose of on-prem infra) or a competent Sys admin can handle all this to give a cloud like experience for end developers?

Can someone share their experience or share any write ups on this topic?

For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it. Hosting application was done by copying the binaries on a particular well known machine and running npm commands and restarting nginx. Log a ticket with sys admin to create a DNS entry to point a reserve and point a internal DNS to this machine (no load balancer). Deployment was a shell script which rcp new binaries and restarts nginx. No monitoring or observability stack. There was a script which will log you into a random machine for you to run your workloads (be ready to get angry IMs from more senior quants running their workload in that random machine if your development build takes up enough resources to effect their work). I can go on and on but I think you get the idea.

brongondwana1y ago

We're (very boring I know) just putting it all in a git repository with a Makefile which deploys it, plus some basic orchestration to run 'make diff' across the cluster and see what's out of sync, and 'make install' across hosts to deploy it into pl ace.

It's clunky, but simple, repeatable, and easily (vsfo) understood.

As for the bigger things, software etc - we have scripts that generate Debian packages which we store in our own private repo. You just install `fastmail-server` and the dependency management updates everything. There's a daily cronjob which checks if there are updated security packages or thing we failed to correctly deploy and emails us as well.

It's amazing what you can build on top of the OS provided tools with not too much complexity if you don't overthink it.

noprocrasted1y ago

> identity management (AAD/IAM)

Do you mean for administrative access to the machines (over SSH, etc) or for "normal" access to the hosted applications?

Admin access: Ansible-managed set of UNIX users & associated SSH public keys, combined with remote logging so every access is audited and a malicious operator wiping the machine can't cover their tracks will generally get you pretty far. Beyond that, there are commercial solutions like Teleport which provide integration with an IdP, management web UI, session logging & replay, etc.

Normal line-of-business access: this would be managed by whatever application you're running, not much different to the cloud. But if your application isn't auth-aware or is unsafe to expose to the wider internet, you can stick it behind various auth proxies such as Pomerium - it will effectively handle auth against an IdP and only pass through traffic to the underlying app once the user is authenticated. This is also useful for isolating potentially vulnerable apps.

> provisioning and running VMs

Provisioning: once a VM (or even a physical server) is up and running enough to be SSH'd into, you should have a configuration management tool (Ansible, etc) apply whatever configuration you want. This would generally involve provisioning users, disabling some stupid defaults (SSH password authentication, etc), installing required packages, etc.

To get a VM to an SSH'able state in the first place, you can configure your hypervisor to pass through "user data" which will be picked up by something like cloud-init (integrated by most distros) and interpreted at first boot - this allows you to do things like include an initial SSH key, create a user, etc.

To run VMs on self-managed hardware: libvirt, proxmox in the Linux world. bhyve in the BSD world. Unfortunately most of these have rough edges, so commercial solutions there are worth exploring. Alternatively, consider if you actually need VMs or if things like containers (which have much nicer tooling and a better performance profile) would fit your use-case.

> deployments

Depends on your application. But let's assume it can fit in a container - there's nothing wrong with a systemd service that just reads a container image reference in /etc/... and uses `docker run` to run it. Your deployment task can just SSH into the server, update that reference in /etc/ and bounce the service. Evaluate Kamal which is a slightly fancier version of the above. Need more? Explore cluster managers like Hashicorp Nomad or even Kubernetes.

> Network side of things like VNet

Wireguard tunnels set up (by your config management tool) between your machines, which will appear as standard network interfaces with their own (typically non-publicly-routable) IP addresses, and anything sent over them will transparently be encrypted.

> DNS

Generally very little reason not to outsource that to a cloud provider or even your (reputable!) domain registrar. DNS is mostly static data though, which also means if you do need to do it in-house for whatever reason, it's just a matter of getting a CoreDNS/etc container running on multiple machines (maybe even distributed across the world). But really, there's no reason not to outsource that and hosted offerings are super cheap - so go open an AWS account and configure Route53.

> securely opening ports

To begin with, you shouldn't have anything listening that you don't want to be accessible. Then it's not a matter of "opening" or closing ports - the only ports that actually listen are the ones you want open by definition because it's your application listening for outside traffic. But you can configure iptables/nftables as a second layer of defense, in case you accidentally start something that unexpectedly exposes some control socket you're not aware of.

> Monitoring setup across the stack

collectd running on each machine (deployed by your configuration management tool) sending metrics to a central machine. That machine runs Grafana/etc. You can also explore "modern" stuff that the cool kids play with nowadays like VictoriaMetrics, etc, but metrics is mostly a solved problem so there's nothing wrong with using old tools if they work and fit your needs.

For logs, configure rsyslogd to log to a central machine - on that one, you can have log rotation. Or look into an ELK stack. Or use a hosted service - again nothing prevents you from picking the best of cloud and bare-metal, it's not one or the other.

> safely expose an application externally

There's a lot of snake oil and fear-mongering around this. First off, you need to differentiate between vulnerabilities of your application and vulnerabilities of the underlying infrastructure/host system/etc.

App vulnerabilities, in your code or dependencies: cloud won't save you. It runs your application just like it's been told. If your app has an SQL injection vuln or one of your dependencies has an RCE, you're screwed either way. To manage this you'd do the same as you do in cloud - code reviews, pentesting, monitoring & keeping dependencies up to date, etc.

Infrastructure-level vulnerabilities: cloud providers are responsible for keeping the host OS and their provided services (load balancers, etc) up to date and secure. You can do the same. Some distros provide unattended updates (which your config management tool) can enable. Stuff that doesn't need to be reachable from the internet shouldn't be (bind internal stuff to your Wireguard interfaces). Put admin stuff behind some strong auth - TLS client certificates are the gold standard but have management overheads. Otherwise, use an IdP-aware proxy (like mentioned above). Don't always trust app-level auth. Beyond that, it's the usual - common sense, monitoring for "spooky action at a distance", and luck. Not too much different from your cloud provider, because they won't compensate you either if they do get hacked.

> For more context, I worked at a very large hedge fund briefly which had a small DC worth of VERY beefy machines but absolutely no platform on top of it...

Nomad or Kubernetes.

rtfusgihkuj1y ago

No, using Ansible to distribute public keys does not get you very far. It's fine for a personal project or even a team of 5-6 with a handful, but beyond that you really need a better way to onboard, offboard, and modify accounts. If you're doing anything but a toy project, you're better off starting off with something like IPA for host access controls.

2 more replies

xiande041y ago

Aside: Fastmail was the best email provider I ever used. The interface was intuitive and responsive, both on mobile and web. They have extensive documentation for everything. I was able to set up a custom domain and and a catch-all email address in a few minutes. Customer support is great, too. I emailed them about an issue and they responded within the hour (turns out it was my fault). I feel like it's a really mature product/company and they really know what they're doing, and have a plan for where they're going.

I ended up switching to Protonmail, because of privacy (Fastmail is within the Five Eyes (Australia)), which is the only thing I really like about Protonmail. But I'm considering switching back to Fastmail, because I liked it so much.

kevin_thibedeau1y ago

Their Android client has been less than stellar in the past but recent releases are significantly improved. Uploading files, in particular, was a crapshoot.

gausswho1y ago

I also chose Proton for the same reason. It hurts that their product development is glacial but that's a crucial component that I don't understand why Fastmail doesn't try to offer.

TheFlyingFish1y ago

Lots of people here mentioning reasons to both use and avoid the cloud. I'll just chip in one more on the pro-cloud side: reliability at low scale.

To expand: At $dayjob we use AWS, and we have no plans to switch because we're tiny, like ~5000 DAU last I checked. Our AWS bill is <$600/mo. To get anything remotely resembling the reliability that AWS gives us we would need to spend tens of thousands up-front buying hardware, then something approximating our current AWS bill for colocation services. Or we could host fully on-prem, but then we're paying even more up-front for site-level stuff like backup generators and network multihoming.

Meanwhile, RDS (for example) has given us something like one unexplained 15-minute outage in the last six years.

Obviously every situation is unique, and what works for one won't work for another. We have no expectation of ever having to suddenly 10x our scale, for instance, because we our growth is limited by other factors. But at our scale, given our business realities, I'm convinced that the cloud is the best option.

jjeaff1y ago

This is a common false dichotomy I see constantly. Cloud vs, buy and build your own hardware from scratch and colocate/build own datacenter.

Very few non-cloud users are buying their own hardware. You can simply rent dedicated hardware in a datacenter. For significantly cheaper than anything in the cloud. That being said, certain things like object storage, if you don't need very large amounts of data, are very handy and inexpensive from cloud services considering the redundancy and uptime they offer.

ttul1y ago

This works even at $1M/mo AWS spend. As you scale, the discounts get better. You get into the range of special pricing where they will make it work against your P&L. If you’re venture funded, they have a special arm that can do backflips for you.

I should note that Microsoft also does this.

nisa1y ago

Love this article and I'm also running some stuff on old enterprise servers in some racks somehwere. Now over the last year I've had to dive into Azure Cloud as we have customers using this (b2b company) and I finally understood why everyone is doing cloud despite the price:

Global permissions, seamless organization and IaC. If you are Fastmail or a small startup - go buy some used dell poweredge with epycs in some Colo rack with 10Gbe transit and save tons of money.

If you are a company with tons of customers, ton's of requirements it's powerful to put each concern into a landing zone, run some bicep/terraform - have a ressource group to control costs and get savings on overall core-count and be done with it.

Assign permissions into a namespace for your employe or customer - have some back and forth about requirements and it's done. No need to sysadmin across servers. No need to check for broken disks.

I'm also blaming the hell of vmware and virtual machines for everything that is a PITA to maintain as a sysadmin but is loved because it's common knowledge. I would only do k8s on bare-metal today and skip the whole virtualization thing completly. I guess it's also these pains that are softened in the cloud.

mgaunard1y ago

Why is it surprising? It's well known cloud is 3 times the price.

diggan1y ago

Because the default for companies today is cloud, even though it almost never makes sense. Sure, if you have really spikey load, need to dynamically scale at any point and don't care about your spend, it might make sense.

Ive even worked in companies where the engineering team spent effort and time on building "scalable infrastructure" before the product itself even found product-market fit...

dewey1y ago

Nobody said it's surprising though, they are well aware of it having done it for more than two decades. Many newcomers are not aware of it though, as their default is "cloud" and they never even shopped for servers, colocation or looked around on the dedicated server market.

aimanbenbaha1y ago

I don't think they're not just aware. But purely from scaling and distribution perspective it'd be wiser to start on cloud while you're still on the product-market fit phase. Also 'bare metal' requires more on the capex end and with how our corporate tax system is set it's just discouraging to go on this lane first and it'd be better off to spend on acquiring clients.

Also I'd guess a lot of technical founders are more familiar with cloud/server-side than with dealing or delegating sysadmin taks that might require adding members to the team.

1 more reply

jmakov1y ago

Would be interesting to know how files get stored. They don't mention any distributed FS solutions like SeaweedFS so once a drive is full, does the file get sent to another one via some service? Also ZFS seems an odd choice since deletions (esp of small files) at +80% full drive are crazy slow.

ryao1y ago

Unlike ext4 that locks the directory when unlinking, ZFS is able to scale on parallel unlinking. In specific, ZFS has range locks that permit directory entries to be removed in parallel from the extendible hash trees that store them. While this is relatively slow for sequential workloads, it is fast on parallel workloads. If you want to delete a large directory subtree fast on ZFS, do the rm operations in parallel. For example, this will run faster on ZFS than a naive rm operation:

  find /path/to/subtree -name -type f | parallel -j250 rm --
  rm -r /path/to/subtree

A friend had this issue on spinning disks the other day. I suggested he do this and the remaining files were gone in seconds when at the rate his naive rm was running, it should have taken minutes. It is a shame that rm does not implement a parallel unlink option internally (e.g. -j), which would be even faster, since it would eliminate the execve overhead and likely would eliminate some directory lookup overhead too, versus using find and parallel to run many rm processes.

For something like fast mail that has many users, unlinking should be parallel already, so unlinking on ZFS will not be slow for them.

By the way, that 80% figure has not been true for more than a decade. You are referring to the best fit allocator being used to minimize external fragmentation under low space conditions. The new figure is 96%. It is controlled by metaslab_df_free_pct in metaslab.c:

https://github.com/openzfs/zfs/blob/zfs-2.2.0/module/zfs/met...

Modification operations become slow when you are at/above 96% space filled, but that is to prevent even worse problems from happening. Note that my friend’s pool was below the 96% threshold when he was suffering from a slow rm -r. He just had a directory subtree with a large amount of directory entries he wanted to remove.

For what it is worth, I am the ryao listed here and I was around when the 80% to 96% change was made:

https://github.com/openzfs/zfs/graphs/contributors

switch0071y ago

I discovered this yesterday! Blew my mind. I had to check 3 times that the files were actually gone and that I specified the correct directory as I couldn't believe how quick it ran. Super cool

brongondwana1y ago

Unlinking gets done asynchronously on the weekends from Cyrus, using the `cyr_expire` tool. Right now it only runs one unlinking process at a time on the whole machine due to historical ext4 issues ... but maybe we should revisit that now we're on ZFS and NVMe. Thanks for the reminder.

jmakov1y ago

Thank you very much for sharing this, very insightful.

1 more reply

shrubble1y ago

The open-source Cyrus IMAP server which they mention using, has replication built-in. ZFS also has built-in replication available.

Deletion of files depends on how they have configured the message store - they may be storing a lot of data into a database, for example.

mastax1y ago

ZFS replication is quite unreliable when used with ZFS native encryption, in my experience. Didn't lose data but constant bugs.

1 more reply

ackshi1y ago

Keeping enough free space should be much less of a problem with SSDs. They can tune it so the array needs to be 95% full before the slower best-fit allocator kicks in. https://openzfs.readthedocs.io/en/latest/performance-tuning....

I think that 80% figure is from when drives were much smaller and finding free space over that threshold with the first-fit allocator was harder.

brongondwana1y ago

Emails are stored in cyrus-imapd.

For now, the "file storage" product is a Node tree in mysql, with content stored in a content-addressed blob store, which is some custom crap I wrote 15 years ago that is still going strong because it's so simple there's not much to go wrong.

We do plan to eventually move the blob storage into Cyrus as well though, because then we have a single replication and backup system rather than needing separate logic to maintain the blob store.

louwrentius1y ago

I like this writeup, informative and to-the-point.

Today, the cloud isn’t about other people’s hardware.

It’s about infrastructure being an API call away. Not just virtual machines but also databases, load-balancers, storage, and so on.

The cost isn’t the DC or the hardware, but the hours spend on operations.

And you can abuse developers to do operations on the side :-)

zelphirkalt1y ago

And then come the weird aspects of bad cloud service providers, like IONOS, who have broken OS images, a provisioning API, that is a bottleneck, where what other people do and how much they do can slow down your own provisioning and creating network interfaces can take minutes via their API and their customer services says "That's how it is, cannot change it.", and you get a very shitty web user interface, that desperately tries to be a single page app, yet has all the default browser functionality like the back button broken. Yet they still cost literally 10x what Hetzner cloud costs, while Hetzner basically does everything better.

And then it is still also about other people's hardware in addition to that.

goldeneye13_1y ago

Didn’t see this in the article, do they have multi az redundancy? I.e. if the entire raid goes up in flames what’s the recovery process?

cyrnel1y ago

Looks like they do mention that elsewhere: https://www.fastmail.com/features/reliability/

> Fastmail has some of the best uptime in the business, plus a comprehensive multi data center backup system. It starts with real-time replication to geographically dispersed data centers, with additional daily backups and checksummed copies of everything. Redundant mirrors allow us to failover a server or even entire rack in the case of hardware failure, keeping your mail running.

sufehmi1y ago

https://www.fastmail.com/blog/throwback-security-confidentia...

Amfy1y ago

I believe they replicate from NJ to WA (Seattle). At least that's something they spoke about many years ago.

brongondwana1y ago

PHL to STL these days, but same design:

https://www.fastmail.com/blog/moving-house-new-datacentre/

comboy1y ago

Yeah, that makes me feel uneasy as a long time fastmail user.

Beijinger1y ago

I was told Fastmail is excellent, and I am not a big fan of gmail. Once locked out for good in gmail, your email and apps associated with it, are gone forever. Source? Personal experience.

"A private inbox $60 for 12 months". I assume it is USD, not AU$ (AFAIK, Fastmail is based in Australia.) Still pricey.

At https://www.infomaniak.com/ I can buy email service for an (in my case external) domain for 18 Euro a year and I get 5 inboxes. And it is based in Switzerland, so no EU or US jurisdiction.

I have a few websites and fastmail would just be prohibitive expensive for me.

qingcharles1y ago

You can have as many domains as you want for free in your Fastmail account. There are no extra fees.

I've used them for 20 years now. Highly recommended.

steve_adams_861y ago

Wait, really? I pay for two separate domains. What am I missing?

I'm happy to pay them because I love the service (and it's convenient for taxes), but I feel like I should know how to configure multiple domains under one account.

1 more reply

aquariusDue1y ago

Personally I prefer Migadu and tend to recommend them to tech savvy people. Their admin panel is excellent and straightforward to use, prices are based on usage limits (amount of emails sent/received) instead of number of mailboxes.

Migadu is just all around good, only downsides I can find are subjective. The fact that they're based in Switzerland and unless you're "good with computers" something like Fastmail will probably be better.

Amfy1y ago

Seems Migadu is hosted on OVH though? Huge red flag.. no control over infrastructure (think of Hetzner shutting down customers with little to no warning)

mariusor1y ago

My suggestion would be to try Purelymail. They don't offer much in the way of a web interface to email, but if you bring your own client, it's a very good provider.

I'm paying something like $10 per year for multiple domains with multiple email addresses (though with little traffic). I've been using them for about 5 years and I had absolutely no issues.

crossroadsguy1y ago

Purelymail is just one person show. May that one person live long and prosper, but I am not putting my faith or email in that business.

1 more reply

Beijinger1y ago

Pricing it hard to understand: https://purelymail.com/advancedpricing

Usernames on shared domains:

1 to 6 letters: $0.20 per user per year 7 to 12 letters: $0.05 per user per year 13+ letters: $0.02 per user per year

WTF?

1 more reply

indulona1y ago

I am working on a personal project(some would call it startup, but i have no intention of getting external financing and other americanisms) where i have set up my own cdn and video encoding, among other things. These days, whenever you have a problem, everyone answers "just use cloud" and that results in people really knowing nothing any more. It is saddening. But on the other hand it ensures all my decades of knowledge will be very well paid in the future, if i'd need to get a job.

tiffanyh1y ago

FYI - Fastmail web client has Offline support in beta right now.

https://www.fastmail.com/blog/offline-in-beta/

mdaniel1y ago

And if anyone is curious, I actually live on their https://betaapp.fastmail.com release and find it just as stable as the "mainline" one but with the advantage of getting to play with all the cool toys earlier. Bonus points (for me) in that they will periodically conduct surveys to see how you like things

rob_c1y ago

Omg it's 1970 and we have IMAP now... Oh wait...

ForHackernews1y ago

Very confused by this. What is in beta? I've had "offline" email access for 25 years. It's called an IMAP client.

1 more reply

rmbyrro1y ago

if you don't have high bandwidth requirements, like for background / batch processing, the ovh eco family [1] of bare metal servers is incredibly cheap

[1] https://eco.ovhcloud.com/en/

caidan1y ago

I absolutely love Fastmail. I moved off of Gmail years ago with zero regrets. Better UI, better apps, better company, and need I say better service? I still maintain and fetch from a Gmail account so it all just works seamlessly for receiving and sending Gmail, so you don’t have to give anything up either.

mlfreeman1y ago

I moved from my own colocated 1U running Mailcow to Fastmail and don't regret it one bit. This was an interesting read, glad to see they think things through nice and carefully.

The only things I wish FM had are all software:

1. A takeout-style API to let me grab a complete snapshot once a week with one call

2. The ability to be an IdP for Tailscale.

brongondwana1y ago

1. hoping to have a JMAP archive format at some point which should cover that. I'd hope that normally you'd be fetching a delta update rather than the whole thing. We've got enough bandwidth for a few people do to it, but I wouldn't want every customer pulling their entire archive every week of 99% the same immutable data; that would be kinda sucky.

2. yeah, I'd love that too - we're keen to integrate with everything else that people are using. We have a basic in-house IdP thing for our own staff to authenticate against our hosted services, but haven't scaled it out. This will happen eventually, though I've been burned enough times I don't want to promise a timeframe.

1 more reply

petesergeant1y ago

I use Fastmail for my personal mail, and I don’t regret it, but I’m not quite as sold as you are, I guess maybe because I still have a few Google work accounts I need to use. Spam filtering in Fastmail is a little worse, and the search is _terrible_. The iOS app is usable but buggy. The easy masked emails are a big win though, and setting up new domains feels like less of a hassle with FM. I don’t regret using Fastmail, and I’d use them again for my personal email, but it doesn’t feel like a slam dunk.

xerp29141y ago

100% this. I migrated from Gmail to Fastmail about 5 years ago and it has been rock solid. My only regret is that I didn't do it sooner.

jb19911y ago

Their UI is definitely faster but I do prefer the gmail UI, for example how new messages are displayed in threads is quite useless in fastmail.

pawelduda1y ago

Their android app has always been much snappier than Gmail, it's the little things that drew me to it years ago

ackshi1y ago

I'm a little surprised it seems they didn't have some existing compression solution before moving to zfs. With so much repetitive text across emails I would think there would be a LOT to gain, such as from dictionaries, compressing many emails into bigger blobs, and fine-tuning compression options.

silvestrov1y ago

They use ZFS with zstd which likely compresses well enough.

Custom compression code can introduce bugs that can kill Fastmail's reputation of reliability.

It's better to use a well tested solution that cost a bit more.

rob_c1y ago

Keen to move certain tasks to ZFS but not the ones that matter...

Frankly given emails are normally ~4kB objects I suspect the compression overheads are probably not that worth it unless it's for attachments only. Not attacking ZFS it's compression and checksumming are among best in class, but the compression would work better if it weren't limited to small files. Here ZFS has made a lot of wins I've not had a problem with many files on ZFS due to L1/L2 ARC but the cost is metadata ops can be painful on many small files.

The evidence they IOPS limited it that they went for SSD or better when they could store the same capacity on rust for much cheaper now.

Yeah I think moving the compression or file access up to abstract what is being written to disk ala protonmail (I don't like their offerings, but like their tech) means you can have compression over 4MB not 4kB blocks which matters when you recall data from disks for , I don't know... Backups or search?

also remember RAID!=backups ;)

antihero1y ago

I’ve started to host my own sites and stuff on an old MacBook in a cupboard with a shit old external hardware Ava microk8s and it’s great!

theoreticalmal1y ago

Another homelabber joins the ranks!!

antihero1y ago

Just implemented a dyndns system using K8s CronJobs + GitOps + CloudFlare Terraform, however next stage will be moving that over to CloudFlare tunnels which should be more reliable and nicer, fully within the Terraform and not relying on polling a random JSON IP service (which a terrifying SPOF)

throw0101b1y ago

> So after the success of our initial testing, we decided to go all in on ZFS for all our large data storage needs. We’ve now been using ZFS for all our email servers for over 3 years and have been very happy with it. We’ve also moved over all our database, log and backup servers to using ZFS on NVMe SSDs as well with equally good results.

If you're looking at ZFS on NVMe you may want to look at Alan Jude's talk on the topic, "Scaling ZFS for the future", from the 2024 OpenZFS User and Developer Summit:

* https://www.youtube.com/watch?v=wA6hL4opG4I

* https://openzfs.org/wiki/OpenZFS_Developer_Summit_2024

There are some bottlenecks that get in the way of getting all the performance that the hardware often is capable of.

ttul1y ago

I think mailbox hosting is a special use case. The primary cost is storage and bandwidth and you can indeed do better on storage and bandwidth than what Amazon offers. That being said, if Fastmail asked Amazon for special pricing to make the move, they would get it.

neeeeeeal1y ago

What not many people talk about in the comments is how the hardware route is fairly stacked against smaller players. Large enterprises buy the same hardware as small and midsize businesses at a fraction of the cost, which significantly impacts the economics of this decision. Even if you have the capability and desire, if each server costs your business double what an enterprise would pay, it becomes less attractive pretty quickly.

kayson1y ago

Any ideas how they manage the ZFS encryption key? I've always wondered what you'd do in an enterprise production setting. Typing the password in at a prompt as any seem scalable (but maybe they have few enough servers that it's manageable) and keeping it in a file on disk or on removable storage would seem to defeat the purpose...

herf1y ago

zfs encryption is still corrupting datasets when using zfs send/receive for backup (huge win for mail datasets), would be cautious about using it in production:

https://github.com/openzfs/zfs/issues/12014

rob_c1y ago

Please stop using send/recv . Your backups should be based on non ZFS tech to avoid all your eggs in one basket. Yes send/recv is fine for immediate recovery, but other than block level replication for immediate (my server is now inside the tornado) recovery this isn't advised.

Also, who cares if a single filesystem dies, that's why you have inter-server replication. Nuke the bad server and rebuild before the next 3 or 4 die.

klysm1y ago

I’ll never use ZFS in production after I was on a team that used it at petabyte scale. It’s too complex and tries to solve problems that should be solved at higher layers.

brongondwana1y ago

yeah, we use Cyrus replication still - it's protocol specific so it detects changes very efficiently as well, using the internal MODSEQ system also used for the JMAP /changes and IMAP CONDSTORE/QRESYNC.

Plus it has protocol consistency sanity checks built in.

Plus, I wrote it :p

kwakubiney1y ago

If I remember correctly, StackOverflow does something similar. The then Director of Engineering speaks about it on here[1]

[1]https://hanselminutes.com/847/engineering-stack-overflow-wit...

e12e1y ago

They also have a SaaS product that lives in the cloud:

https://stackoverflow.blog/2023/08/30/journey-to-the-cloud-p...

Axsuul1y ago

Anyone know what are some good data centers or providers to host your bare metal servers?

klysm1y ago

You’re probably looking for the term “colo”

veidr1y ago

"WHY we use our own hardware..."

The why is is the interesting part of this article.

veidr1y ago

I take that back; this is (to me)t he most interesting part:

"Although we’ve only ever used datacenter class SSDs and HDDs failures and replacements every few weeks were a regular occurrence on the old fleet of servers. Over the last 3+ years, we’ve only seen a couple of SSD failures in total across the entire upgraded fleet of servers. This is easily less than one tenth the failure rate we used to have with HDDs."

veidr1y ago

I wanted to revisit this after checking my own anecdata. (But based on logfiles not just like recollections.)

I've had a ZFS system or some sort for about 10 years, and before that I had proprietary RAID chassis like Pegasus2 and Synology etc.

I can't quite say how many drives I have used, because my records are not that good. But maybe its like 100 drives since 2008. Maybe 150. Less than 200.

I had over 10 HDD devices fail (probably 13, confidence of like 90%).

I've only ever had 1 SSD fail.

I've also used the absolute cheapest shite SSDs.

I suspect the failure modes tend to be

- hard disks fail whenever the fuck, who knows

- SSDs fail in the beginning or end of their reasonable service life

P.S. With ZFS though, you don't really care if/when they fail. I've so far (knock on wood) never lost any data with a ZFS config with >1 disk redundancy and reasonable backups.

tuananh1y ago

gmail does spam filtering very well for me. fastmail on the other hands, puts lots of legit emails into spam folder. manually marking "not spam" doesn't help

other than that, i'm happy with fastmail.

ghaff1y ago

If I look at my Gmail SPAM folder, there is very rarely something genuinely important in it. What there is a fair bit of though is random newsletters and announcements that I may have signed up for in some way shape or form that I don't really care about or generally look at. I assume they've been reported as SPAM by enough people rather than simply unsubscribed to that Google now labels them as such.

TMWNN1y ago

>fastmail on the other hands, puts lots of legit emails into spam folder. manually marking "not spam" doesn't help

Fastmail explicitly says that moving mail to/from a spam folder via a mail client does not automaticallyl retrain. <https://www.fastmail.help/hc/en-us/articles/1500000278142-Im...> (I never did figure out if Gmail acts the same way or not.)

jacobdejean1y ago

iCloud is just as bad, sends important things to spam constantly and marking as “not spam” has never done anything perceivable.

EdJiang1y ago

I was a bit confused by the section on backups. How do they manage moving the data offsite with the on-premises backup servers? Wouldn’t that be a cost savings by going cloud?

IYasha1y ago

Very, very reasonable! And the HDD vs. SSD part is just reading my thoughts. :)

xsc1y ago

Are those backups geographically distributed?

christophilus1y ago

Yes.

briHass1y ago

The biggest win with running your own infra is disk/IO speeds, as noted here and in DHH's series on leaving cloud (https://world.hey.com/dhh/we-have-left-the-cloud-251760fb)

The cloud providers really kill you on IO for your VMs. Even if 'remote' SSDs are available with configurable ($$) IOPs/bandwidth limits, the size of your VM usually dictates a pitiful max IO/BW limit. In Azure, something like a 4-core 16GB RAM VM will be limited to 150MB/s across all attached disks. For most hosting tasks, you're going to hit that limit far before you max out '4 cores' of a modern CPU or 16GB of RAM.

On the other hand, if you buy a server from Dell and run your own hypervisor, you get a massive reserve of IO, especially with modern SSDs. Sure, you have to share it between your VMs, but you own all of the IO of the hardware, not some pathetic slice of it like in the cloud.

As is always said in these discussions, unless you're able to move your workload to PaaS offerings in the cloud (serverless), you're not taking advantage of what large public clouds are good at.

noprocrasted1y ago

Biggest issue isn't even sequential speed but latency. In the cloud all persistent storage is networked and has significantly more latency than direct-attached disks. This is a physical (speed of light) limit, you can't pay your way out of it, or throw more CPU at it. This has a huge impact for certain workloads like relational databases.

briHass1y ago

I ran into this directly trying to use Azure's SMB as a service offering (Azure Files) for a file-based DB. It currently runs on a network share on-prem, but moving it to an Azure VM using that service killed performance. SMB is chatty as it is, and the latency of tons of small file IO was horrendous.

Interestingly, creating a file share VM deployed in the same proximity group has acceptable latency.

sgarland1y ago

Yep. This is why my 12-year old Dell R620s with Ceph on NVMe via Infiniband outperform the newest RDS and Aurora instances: the disk latency is measured in microseconds. Locally attached is of course even faster.

beaugunderson1y ago

I don't trust anything from fastmail after they bought pobox and forced me onto their new service which fails at the one thing pobox did well--forwarding email. They also refused to give me a refund (prorated or not) for removing the product I was using and substituting a defective one.

ylee1y ago

What problems have you had? I also came over from pobox and thought that the transition was quite straightforward.

beaugunderson1y ago

Anything erroneously marked as spam can not be released to the forwarding address—-meaning they fail at their one job, forwarding email. Pobox had a great interface for quickly releasing messages to the forwarding address.

1 more reply

lakomen1y ago

You also terminate accounts at your sole discretion

awinter-py1y ago

everyone is 'cattle not pets' except the farm vet who is shoulder-deep in a cow

(my experience with managed kubernetes)

0xbadcafebee1y ago

I've been doing this job for almost as long as they have. I work with companies that do on-prem, and I work with companies in the cloud, and both. Here's the low down:

1. The cost of the server is not the cost of on-prem. There are so many different kinds of costs that aren't just monetary. ("we have to do more ourselves, including planning, choosing, buying, installing, etc,") Those are tasks that require expertise (which 99% of "engineers" do not possess at more than a junior level), and time, and staff, and correct execution. They are much more expensive than you will ever imagine. Doing any of them wrong will causes issues that will eventually cost you business (customers fleeing, avoiding). That's much worse than a line-item cost.

2. You have to develop relationships for good on-prem. In order to get good service in your rack (assuming you don't hire your own cage monkey), in order to get good repair people for your hardware service accounts, in order to ensure when you order a server that it'll actually arrive, in order to ensure the DC won't fuck up the power or cooling or network, etc. This is not something you can just read reviews on. You have to actually physically and over time develop these relationships, or you will suffer.

3. What kind of load you have and how you maintain your gear is what makes a difference between being able to use one server for 10 years, and needing to buy 1 server every year. For some use cases it makes sense, for some it really doesn't.

4. Look at all the complex details mentioned in this article. These people go deep, building loads of technical expertise at the OS level, hardware level, and DC level. It takes a long time to build that expertise, and you usually cannot just hire for it, because it's generally hard to find. This company is very unique (hell, their stack is based on Perl). Your company won't be that unique, and you won't have their expertise.

5. If you hire someone who actually knows the cloud really well, and they build out your cloud env based on published well-architected standards, you gain not only the benefits of rock-solid hardware management, but benefits in security, reliability, software updates, automation, and tons of unique features like added replication, consistency, availability. You get a lot more for your money than just "managed hardware", things that you literally could never do yourself without 100 million dollars and five years, but you only pay a few bucks for it. The value in the cloud is insane.

6. Everyone does cloud costs wrong the first time. If you hire somebody who does have cloud expertise (who hopefully did the well-architected buildout above), they can save you 75% off your bill, by default, with nothing more complex than checking a box and paying some money up front (the same way you would for your on-prem server fleet). Or they can use spot instances, or serverless. If you choose software developers who care about efficiency, they too can help you save money by not needing to over-allocate resources, and right-sizing existing ones. (Remember: you'd be doing this cost and resource optimization already with on-prem to make sure you don't waste those servers you bought, and that you know how many to buy and when)

7. The major takeaway at the end of the article is "when you have the experience and the knowledge". If you don't, then attempting on-prem can end calamitously. I have seen it several times. In fact, just one week ago, a business I work for had three days of downtime, due to hardware failing, and not being able to recover it, their backup hardware failing, and there being no way to get new gear in quickly. Another business I worked for literally hired and fired four separate teams to build an on-prem OpenStack cluster, and it was the most unstable, terrible computing platform I've used, that constantly caused service outages for a large-scale distributed system.

If you're not 100% positive you have the expertise, just don't do it.

BackBlast1y ago

> 7. ... Another business I worked for literally hired and fired four separate teams to build an on-prem OpenStack cluster, and it was the most unstable, terrible computing platform I've used, that constantly caused service outages for a large-scale distributed system.

I've seen similarly unstable cloud systems. It's generally not the tool's fault, it's the skill of the wielder.

brongondwana1y ago

Yeah, we have good vendor relationships, good datacenter relationships, and we've made mis-steps along the way for sure. Own hardware isn't for everyone, but it's been great for us. YMMV

tucnak1y ago

Yeah, Cloud is a bit of a scam innit? Oxide is looking more and more attractive every day as the industry corrects itself from overspending on capabilities they would never need.

klysm1y ago

It’s trading time for money

jgb19841y ago

Fake news. I've got my bare metal server deployed and installed with my ansible playbook even before you manage to log into the bazillion layers of abstraction that is AWS.

2 more replies

rob_c1y ago

Yes, welcome to business. But frankly an email provider needs to have their own metal, if they don't they're not worth doing business with

nprateem1y ago

Yeah and some people reckon web frameworks are bad too. Sometimes it might make sense to host your on your own hardware but almost certainly not for startups.

brongondwana1y ago

We also did our own web framework :p

https://github.com/fastmail/overture

oldpersonintx1y ago

longtime FM user here

good on them, understanding infrastructure and cost/benefit is essential in any business you hope to run for the long haul

rrgok1y ago

I would like to know the tech stack behind it.

brongondwana1y ago

There's various articles on our blog about our stack!

lokimedes1y ago

A mail-cloud provider uses its own hardware? Well, that’s to be expected, it would be a refreshing article if it was written by one of their customers.

rob_c1y ago

No they deserve me praise for simply running their stuff on metal... Like a thousand unix sysadmins before and after

pammf1y ago

Cost isn’t always the most important metric. If that was the case, people would always buy the cheapest option of everything.

tndibona1y ago

But what about the cost and complexity of a room with the racks and the cooling needs of running these machines? And the uninterrupted power setup? The wiring mess behind the racks.

bradfa1y ago

There is a very competitive market for colo providers in basically every major metropolitan area in the US, Europe, and Asia. The racks, power, cooling, and network to your machines is generally very robust and clearly documented on how to connect. Deploying servers in house or in a colo is a well understood process with many experts who can help if you don’t have these skills.

rob_c1y ago

Colo offers the ability to ship and deploy and keep latencies down if you're global, but if you're local yes you should just get someone on site and the modern equivalent of a T1 line setup to your premises if you're running "online" services.

hyhconito1y ago

I'm not fastmail but this is not rocket science. Has everyone forgotten how datacentre services work in 2024?

rob_c1y ago

Yes they have and they feel they deserve credit for discovering a WiFi cable is more reliable to the new shiny kit that was sold to them by a vendor...

grishka1y ago

Own hardware doesn't mean own data center. Many data centers offer colocation.

jonatron1y ago

Even for cloud providers, these are mostly other people's problems, eg: Equinix

79521y ago

Do colocation facilities solve that?

dorongrinstein1y ago

We at Control Plane (https://cpln.com) make it easy to repatriate from the cloud, yet leverage the union of all the services provided by AWS, GCP and Azure. Many of our customers moved from cloud A to cloud B, and often to their own colocation cage, and in one case their own home cluster. Check out https://repatriate.cloud

rob_c1y ago

Hosts online service seems to think deserving of medal for discovering that S3 buckets from a cloud provider are crap and cost a fortune.

The heading in this space makes your think they're running custom FPGAs such as with Gmail, not just running on metal... As for drive failures, welcome to storage at scale. Build your solution so it's a weekly task to replace 10disks at a time not critical at 2am when a single disk dies...

Storing/Accessing tonnes of <4kB files is difficult, but other providers are doing this on their own metal with CEPH at the PB scale.

I love ZFS, it's great with per-disk redundancy but CEPH is really the only game in town for inter-rack/DC resilience which I would hope my email provider has.

brongondwana1y ago

Ceph is most certainly not the only game in town. It's good and stuff, but it's just tech. We're using protocol level replication for each of our data stores.

rob_c1y ago

No, let's be honest. CEPH is the only solution for data management at this scale (sub to few PB). The solution which is independent of application or workload. The market share, fact IBM is moving people off other projects internally for this, and the massive backing shows this.

Yes you can have all or a bunch of these features like failure domains via other routes/products but none have all of the stuff together in one place like CEPH.

There's a reason people call it the "Linux of storage". The only alternatives are manage this at a higher level in your stack (reinventing the wheel) or buying PB level solutions from corporate which is like saying I'm buying Oracle and MS over Linux.

Protocol replication means you've reimplemented something which is storage related elsewhere in your stack. It's not incorrect to do so, but there exist better solutions and alternatives now.

1 more reply

j / k navigate · click thread line to collapse