I think Google Cloud Cloud Run is obscenely ahead of its time. Its a product that's adjacent to so many competitors, yet has no direct competitor, and has managed to drive a stake into that niche in a way that makes it such a valuable product.
Its serverless, but not "Lambda Serverless" or "Vercel Serverless" which forces you to adopt an entirely different programming model. Its just docker containers. But its also not serverless in the way Fargate or ACS is "serverless"; its still Scale to Zero.
There's a lot of competition in the managed infrastructure space right now (Railway, Render, Fly, Vercel, etc). But I haven't seen anyone trying to do what Cloud Run does. Cloud Run has its disadvantages (cold starts are bad; it also could be a great fit for background workers/queue consumers/etc, but Google hasn't added any way to scale replicas beyond incoming HTTP requests yet).
But the model is so perfect that I wish more companies would explore that space more, rather than retreating to "how things have always been done" ("pay us $X/mo to run a process") or retreating to the much more boring "custom serverless runtime", "your app is now only a 'AWS Lambda app' and cant run anywhere else congrats".
Fly Machines are more powerful than Google Cloud Run IMO. You can treat them like cloud run, or manage them directly and implement your own Serverless model.
Our PaaS orchestration is implemented entirely I. The client CLI, and it manages Fly Machines directly: https://fly.io/docs/machines/
To go a bit further I'm honestly quite interested(?) in how CGP has sought to differentiate itself from the two other providers by offering this kind of "plug and play" feel to cloud. Certainly there is value to be gained from the absolutely granular service offerings of AWS/Azure, but there's a point when it starts to feel like all I'm doing is building control towers for island landing strips.
I just want my cloud providers ML service to talk to the data lake on the same cloud tenant without having to architect my way through 15 network nics, 30 service accounts, and 4 VDI...
The entire deployment can be easily defined in github actions. Combine that with Cloud Tasks and a Cloud SQL Postgres instance and you have a near infinitely scalable solution.
I ran a system like this where over 30k servers across 7 different data centers all over the US, were hitting cloud function endpoints 24/7 with 30-50+ RPS and I never had a single failure or outage over multiple years. Even better, the whole thing never cost more than about $100/month.
DigitalOcean iss not wildly far off it either.
ECS + Fargate is the closest AWS has to it, but you need to do IAM and Networking to utilise it. If you're in AWS already, it's pretty good, albeit with some frustrating limits
Correct me if I'm wrong, but these are actually not close to Cloud Run. Cloud Run's differentiator is its scaling metric; it scales with incoming requests, and has strict configuration to assert that each replica only handle N concurrent requests. You could maybe get something like this set up on ACI or Fargate, but it'd require stringing together five or six different products. You can also definitely wire up those to autoscale on CPU%, but (1) this is not scale-to-zero, and (2) CPU% kinda sucks as a scaling metric, right? Idk I've never been happy with systems that autoscale on CPU%.
It's not a huge step usually to add the RIC, but it's a bit more tied in to AWS than CloudRun is, which can run arbitrary docker images, if I understand.
Oh, the concept exists. I can make some infrastructure mostly-immutable, myself. But the cloud doesn't give me it out of the box. What the cloud gives me are APIs. If I write software to call those APIs, predict what the allowed values are, predict the failures I might see, write about 5,000 lines of code to handle the failures, attempt to reconcile differences, retry, store my artifacts, reference them, after implementing a build system, etc, I can get one or two things to be immutable. But for the vast majority of services it's actually impossible.
Take an S3 bucket. Can you make an S3 bucket immutable? The objects inside it might be versions, sure. Can you roll back all the objects in the bucket to Version 123? Can you roll back the S3 policy back to revision 22? Can you make it also roll back the CORS rules? Can you diff all these changes and see a log of them? Can you tell the bucket to fix itself back to the correct expected version of itself? Can you tell it to instead adopt 3 new changes, as part of a version of the S3 bucket you tested somewhere else? The answer is "no".
You can fake it, with a configuration management tool like Terraform. But that's as immutable as a file on your filesystem. Any program can overwrite your files at any time; you have to have Puppet configured to monitor your files, and constantly fix the files when they get changed, track the Puppet code in Git, keep your own log of changes, etc. That filesystem isn't immutable, it's mutable! If it was immutable you wouldn't have to use Puppet (or Terraform). And the sad thing is we're all stuck on Terraform, which is actually terrible for a configuration management tool, because it mostly refuses to reconcile inconsistencies (the way every other configuration management tool in history has). It just bombs out and says "Oh shit, that wasn't a change I planned, and you didn't write this HCL code to handle this weird condition, so I'm just gonna bail and not fix this. Good luck getting production working again." Puppet wouldn't stop working if something other than Puppet updated a file. But nobody seems to mind that we literally regressed in functionality, because a company made up new marketing terms for their tools.
Sadly this desired built-in immutability, and the declarative nature of it, won't be built into S3 or other tools for at least a decade or two. They would need to effectively build something akin to K8s just to manage their own components immutably and expose an entirely new API. So we are doomed to do Configuration Management in the cloud, until the cloud starts implementing immutability out of the box.
Now more than ever because I started making an effort to self-host much more than before... the amount of scripts I have to write just to achieve idempotency, nevermind immutability, is staggering, and I am already questioning my approach. Will likely start making use of ZFS or BTRFS snapshots, or I don't know, I'll just start snapshotting manually the entire filesystem on my Linux machines (like store all dir/file paths with their sizes and modification dates; it's a start and you can diff against such "snapshots").
I am just not comfortable with running commands and not having an idea what and where changed. It's insane that everyone is just accepting this! I am not okay with it, I want to see an exact breakdown on what changed and where and how.
IMO working on this and bringing it to the mainstream is loooong overdue.
It's not apparent that problems X, Y and Z will be solved by immutability. Once it's applied everywhere, whole classes of problems just disappear. But until people see the problems disappear, they won't implement it. Catch-22.
You can build your entire app inside a plain HTML file which can be deployed online with something like GitHub pages.
I've built a few apps with it including a real-time chat app which supports both group chat, private 1-on-1 chat with an account system (with access control), OAuth via GitHub... The entire app is only 260 lines of HTML markup and fully serverless (no custom back end code). Access controls are defined via the control panel. All the app's code is in this file: https://github.com/Saasufy/chat-app/blob/main/index.html
You can try the app here (use the 'Log in with GitHub' link): https://saasufy.github.io/chat-app/index.html
Saasufy comes with around 20 generic declarative HTML components which can be assembled in complex ways: https://github.com/Saasufy/saasufy-components?tab=readme-ov-...
There is a bit of a learning curve to figure out how the components work but once you understand it, you can build apps very quickly. The chat app only took me a few hours to build.
I've also been helping a friend to build an application related to HR with Saasufy and I managed to get the basic search functionality working with only 160 lines of HTML markup.
Milliseconds is now possible: https://kraft.cloud/ (e.g., an NGINX web server in under 20 millis).
The basic idea is that FaaS is a leaky abstraction because (a) lots of runtimes are slow to start up and (b) isolation tech isn't good enough. So FaaS services start up VMs and containers and then the user's function which might have to do a lot of init work, like to load reference data, and because that takes too long you have to keep idle capacity around. At that point the abstraction is broken.
So there's a two-part fix:
1. For Java users, the GraalVM native-image tool can pre-initialize and pre-compile a JVM app so that it starts up instantly (including with pre-loaded reference data).
2. Change the isolation model so VMs and containers don't need to be started up anymore. Containers alone can take hundreds of milliseconds to start.
There's also some interesting stuff there that takes advantage of Oracle Cloud's more "edgey" nature than other clouds, where it has more datacenters than others (but smaller).
The new isolation model works by exploiting new hardware features in CPUs that allow for intra-process memory isolation (Intel MPK) combined with hardware-enforced control flow integrity. This requires compiler support, but GraalVM knows about these features and so the cloud can just compile JVM apps to native for you. And what about other apps? Well, many languages run on GraalVM via Truffle, so those are covered (e.g. JavaScript) and for native code you can use a modified LLVM to compile and then do a static verification of any user supplied binaries, like NaCL used to do.
If you put those things together then starting user code that's already available locally becomes just mmapping a shared library into a process, which is extremely fast. It can only exit the hardware/software enforced isolate by going via a trampoline that's equivalent to a syscall, but without needing an actual syscall. The Linux kernel isn't reachable at all.
With that you can have functions that start and stop in milliseconds.
If you're a naysayer in the comments, I would encourage you to go give it an honest try, and consider again why you think infra has to be done in harder ways.
Finally, I believe simple configuration can coexist with code.
P.S.: At dstack, we are building an open-source platform to manage AI infra – a more lightweight and AI-friendly alternative to Kubernetes.
Software Infrastructure 2.0: A Wishlist - https://news.ycombinator.com/item?id=26869050 - April 2021 (195 comments)
If something is "magically" easy, it either is a meaningful design/algo revolution or it overpromises the production case while showing off the trivial. Most of the time it's #2. Docker was #1.
I'm encouraged by the same ideas.
Sometimes you just want something that stays running and doesn't go down and can scale to zero and scale upwards, ideally with revenue.
I kind of want a special mega HTTP form endpoint which I can define a pipeline from, that can go to database and cause background jobs and goes into a mega API automatically.
- CloudRun did a good job, but the autoscaling is too slow to not pay for idle
- Lambda is great, but I want to run way more complex workloads than simple functions
and how Modal exemplifies a lot of the ideas he's been looking for. check it out incl our show notes!
> We are, like what, 10 years into the cloud adoption? Most companies (at least the ones I talk to) run their stuff in the cloud. So why is software still acting as if the cloud doesn't exist?
> As in, I don't want to think about future resource needs, I just want things to magically handle it.
'nuff said.
That said, it's still not as trivial as using managed SaaS but it's still easier than ever to basically spin up your own cloud of sorts, using the wealth of open source tech out there. K3S on Hetzner can do a pretty solid job for cheap. In that sense, the ecosystem around running your own cloud is only improving.
So, here are some thoughts on what seems to be the key points of the article:
* I want to go fast.
Well... yeah, sure, why not... but it's not very important. Lots of other goals will overshadow this one. Also, if we are talking in the context of whatever-as-a-service, there's very little incentive to work on the speed aspect as long as it not taking ages.
Also, reducing infrastructure to whatever-as-a-service is seriously hollowing the definition. I've been in ops / infra for over a decade, and I've barely even touched the as-a-service aspect. Also, whenever I do come in contact with it, it's always awful, and I want to get away from it as fast as possible. Making it go faster won't help that though. The disappointing parts are poor documentation, poor support, proprietary tech. overly narrow scope etc.
* Testing in production
Why is this even a relevant issue?.. Anyways. OP needs to take a trip to the QA department. They obviously don't know why they have one. But it's also possible their QA department is worthless (ours is...) But having a worthless QA department isn't really something to wish for in Infrastructure 2.0. I don't see how this is a good goal.
So, the reason why QA department is necessary, and why CI can possibly cover only a fraction of what can be / should be done with testing is that QA, beside other things, needs to simulate plenty of different possible conditions in controlled environment to be able to investigate and to diagnose problems. Most of the work of QA is spent on RCA, and then figuring out how to present the problem, stripped of all unnecessary components to the development team to be able to fix it. It's not possible to do good QA w/o an ability to isolate components which calls for creation of fake / artificial environments which are not like production.
* Calls to unleash the next order of developer productivity
This is such an MBA b/s... Just give it a break.
For you. For me having to tinker with a repo full of YAML files just to have a Kafka topic provisioned (like it just happened to me this week) can and has killed motivation to the point of not working at all after, for a day or two.
This stuff should be blindingly obvious, to the point a trained monkey should be able to do it.
I have the feeling that many agents are working against such a goal though. Vested interests and all.
You even kinda sorta agree with me by qualifying your statement with this, right after the previous quote:
> Also, if we are talking in the context of whatever-as-a-service, there's very little incentive to work on the speed aspect as long as it not taking ages.
Maybe to me time_it_should_take == X and to you X times 3 is fine, but in the end the brain schemata is the same: have it take LongEnough™ (subjective value) and the person responsible simply checks out mentally.
If I were a CTO or an IT manager I'd be very worried about stuff like this.
> But having a worthless QA department isn't really something to wish for in Infrastructure 2.0. I don't see how this is a good goal.
This is IMO not at all related to the article, nowadays QA depts are removed either because leadership wants to save money or because iteration would grind to a crawl, and many businesses need the next feature the next Wednesday. Nothing to do with infra management I'd think.
Though don't get me wrong, QA is hugely important per se. But I wonder if proper end-to-end automated frontend testing (e.g. with Playwright) won't eventually make them truly extinct. Who knows. I don't.
> This is such an MBA b/s... Just give it a break.
I'll always despise MBA speak but the point of programmer productivity is important. I have no problem churning out features and fixing bugs but give me a slow bureaucratic process and you'll find out what it's like to pay a salary to somebody who pushes to the GitHub repo 5 times a month with diffs like +30-20.
This is understandable, but this isn't about speed. Many YAML files may result in high provisioning speed or low provisioning speed, after all they only give instructions to the program doing the provisioning.
You could legitimately complain about choice of YAML as a platform for infrastructure configuration so several reasons, like:
1. Not having a built-in ability to describe templates. Lots of infrastructure wants to have some sort of polymorphic configuration, and when the infra developers chose YAML to configure it, they didn't account for that. So, instead they use various template engines that strap on this polymorphism on YAML. This was also indirectly mentioned by OP.
2. Poorly structured, especially when it comes to large configuration size. It's easy to accidentally write something you didn't intend. It's hard to search.
3. Being JSON in disguise, it inherits a lot of problems from JSON. Marshaling richer type / structure of data in and out of the program is severely impacted by the primitive and inflexible type system of the format.
But, again, this isn't speed. This is just a different set of problems.
> If I were a CTO or an IT manager I'd be very worried about stuff like this.
Practice shows this is mostly irrelevant. It's hard to reach the point where provisioning speed starts to hurt so much it impacts business decisions. For instance, provisioning in MS Azure is on average twice as slow as it is in AWS. (And deprovisioning is probably four times as slow.) And nobody cares. So many other concerns will overshadow this particular aspect, that you'd feel uncomfortable to even bring it up, if you had to choose between two service providers. Primary driver is cost of running the infrastructure for a long time, overall as a system. Starting time does contribute to the total, but unless your business requires very frequent allocation and deallocation of resources, this won't make a difference. Also, cloud vendors don't bill you for the time that the infrastructure is being brought up, so, it's really hard to make a compelling case to choose the fast-to-provision infra over the slow one just based on that aspect alone.
It has crossed my mind several times recently that I want a word to describe this exact state of affairs. Where a thing has a defect so blatant that it is evident to any user that the creator of the thing has never tried using it.
Eg. an airbnb with no towels in it.
What's the word for this situation?
Otherwise it’s called an MVP and a promise of plugging the holes
In many cases it's "you are hired into this job, this is the tool we give you, if you don't like the tool, take a hike".
Even more so, a lot of software is developed not to be competitive, but to be exclusive. It's a lot easier to be the only choice for doing something than trying to compete with a different tool. I've seen countless examples of tools developed in exactly this paradigm, where the decision to use the tool wasn't made by anyone anywhere close the users of the tool (eg. hospital procurement department buying a PACS or a large avionics company ordering a custom-made budget-management program).
Most crappy software exists because of inertia and corporate policies. If people truly had a choice stuff like MS Teams could be phased out by the end of the next quarter.
When I have to describe to people who don't work with me my interactions with developers (especially of the crappy code like that) from a standpoint of someone who represents the QA side of things... I describe to them my interactions with my five y.o. son:
Me: How as school?
Son: Goooood!
Me: Did you behave?
Son: Yes!
Me: Did the teacher send you into timeout?
Son: Yes...
Me: So how come? You told me you behaved... What did you do?
Son: Played with Ryan!
Me: That doesn't seem like a good reason to send you into timeout.
And we go like this until I either discover that he was yelling in class or I will never know the reason why he was in detention. This is also the pattern of denial I very frequently face when talking to the programmers who wrote the crappy code. Somewhere on the back of their minds they understand that they screwed up, but they will come up with all sorts of concocted reasoning to pretend that they either don't understand why the product sucks, or they would claim that it cannot be made any better, or attack me for not understanding how the product is supposed to work etc. The most recent example would be (in slight adaptation): Me: I discovered that we set PYTHONPATH variable when loading a (Tcl) module.
Dev: I see no problems with that.
Me: The new feature we are releasing to the users is conda support. Conda will not work (well) when this variable is set.
Dev: Did the documentation tell users to load this module?
Me: No, but it's obvious that users would like the functionality provided by the module in addition to using conda. They are made to complement each other. Besides, documentation doesn't say they shouldn't.
Dev: (summons PM)
And then PM continues in the same spirit as the developer. And, my guess is that the reason for it is that nobody really wants to work too hard. There's no reward in making a better quality product if that quality isn't immediately appreciated. Features like latency, throughput, size etc. are immediately visible to the user and are an easy sell. Features like internal consistency in the face of more sophisticated usage: these might never happen, and the user might never know that they were protected from their system collapsing on them by a substantial development effort. So, commercial companies de-prioritize quality. And that's how we get crappy programs.There is certainly a lot of that but it gets even worse: you get actively punished for doing good work in many companies: you end up making other people work like asking managers around for product requirements (that are of course barely written somewhere, if at all) or reminding that sysadmin that they half-arsed the job of the deployment and now must add another k8s resource, or asking another dev why did they do X with the Y library... you want to make sure not to screw something up but you just end up annoying them.
And sadly these things get brought up on meetings. And many over-zealous managers will scold you because they don't like the boat rocked (even if they would actually welcome their initiative; but that assumes they'd have made an effort to understand the situation which is not a given).
It's no surprise that many talented people just end up checking in, doing the bare minimum, and clocking out. The equation is extremely easy to solve: "work X*3, get scolded, don't get promotions, accumulate hostility in colleagues" vs. "work X and have peace and quiet".
Nothing anyone does with software will help.
> I'm not asking for milliseconds! Just please at least get it to less than a second.
What do we measure "less than a second" times in?
> I can set up a static website in AWS, but it takes 45 steps in the console and 12 of them are highly confusing if you never did it before
Anything can be confusing/takes time if you never did before. Getting productive needs time and practice. If your goal is only to set up a static site, AWS is an overkill for it.
> It's sad this is the current state of infrastructure.
It’s sad that some people still haven’t learned to pick the right tool for a problem.
> I could go on, but I won't. I'm dreaming of a world where things are truly serverless.
I don’t even understand what the author wants here. There is no such thing “truly serverless”. Your code will be executed by a server. Period. Serverless is just a fancy marketing term for ephemeral lightweight VMs.
> If I make a change in the AWS console, or if I add a new pod to Kubernetes, or whatever, I want that to happen in seconds
The author obviously doesn’t have any knowledge about distributed systems.
> My deep desire is to make it easy to create ephemeral resources. Do you need a database for your test suite? Create it in the cloud in a way so that it gets garbage collected once your test suite is done.
Fortunately we have Terraform that’s made this possible for a decade(?).
> Code not configuration
Terraform, Pulumi, countless of client libraries for all of the cloud providers.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
Well, he built https://modal.com , one of the coolest things since sliced mangoes, and before that https://github.com/spotify/luigi
This is nit-picky. "Serverless" refers to the "dev", not the "ops", and has done for a while.
Fortunately we have Terraform that’s made this possible for a decade(?).
Setting up production-grade DBs in Terraform is easy?
The autor does make some weird arguments and seem to be creating an emotional setting for something. Like his own product you guys mentioned.
My pods ARE ready in seconds. Wondering why his are not.
Oh, yes, it is. Setting up the resources actually the easiest part, most of the problems originate from the phenomenon that as the developers starts to use more and more "serverless" things, they know less about how the underlying technology works, how to use indexes, structure the database, how replication or transaction works. Production readiness is not just how a resource is configured. It is about how the application uses a resource efficiently.
> This is nit-picky. "Serverless" refers to the "dev", not the "ops", and has done for a while.
There is no "dev" and "ops" serverless. Your application will run on one or multiple CPUs, will use the memory, the disk, the network. When you write the application all of these matter, memory management, network communication, CPU caches, parallel execution, concurrency, disk access. It does not matter if you call it serverless, cloud, bare metal, etc. The basics are the same.
Honestly, the experience of building Beaker Studio made me bearish on AWS. They price gouge and the DX is so bad teams pretty much need CDs. Once I get the time I want to update Beaker Studio so people can deploy to any old Linux box instead. Teams deserve so much better than AWS/Google/Azure.
The author says what they want. It's literally their next sentence:
"As in, I don't want to think about future resource needs, I just want things to magically handle it."
and they have four bullet points with examples of what this means to them earlier.
I think it's fair to argue about the desirability, achievability, etc of this. I don't think it's fair to act as if the author is just spewing buzzwords without explanation.
- Why do I have to think about the underlying pool of resources? Just maintain it for me.
- I don't ever want to provision anything in advance of load.
- I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using.
- Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle.
This article was written in 2021.
AWS Lambda was introduced in 2014 that fulfilled all of those requirements in those bullet points that you mentioned. Google App Engine is the same, it was introduced in 2008.
So again, this article tells only one thing: that the author does not know what he is talking about.