I'm skeptical - I was under the impression that serverless was for small "burstable" apps with relatively low traffic, or background processing.
The two products I work on are both REST APIs that send and receive data from a user interface (react) with roughly 60 API routes each. They have about 100 concurrent users but those users use the apps heavily.
The consensus on the internet seems to be "serverless has its use cases" but it's not clear to me what those use cases are. Are the apps I'm working on good use cases?
Good news: your gut feeling is correct.
Bad news: you will likely lose this battle, unless you're good at playing company politics.
Here's how it typically goes:
1. A new lead/architect/manager joins the company.
2. They push for a new hyped technology/methodology.
-> you are currently here <-
3. The team is split: folks that love new things embrace it, folks that care hate it, rest are indifferent.
4. Because the team is split, the best politician wins (usually the new hire).
5. Switch happens and internally you know it's fucking disaster, but you're still forced celebrate it.
6. When disaster is becoming obvious people start getting thrown under the bus (usually junior engineers or folks that opposed the switch).
Or, which seems more likely, but still just as bad - someone chasing the "successfully redesigned the infrastructure on a scale of the entire company" on their promo packet and resume.
Whether it actually improved the infrastructure is of no concern to them. Not needing to even try to understand the existing infra in-depth to make it happen, and the added job security (due to making the infra more complex and confusing for the rest of engineers to understand), are just a cherry on top.
Call it a cynical take, but this is one of those situations where "the simplest explanation is probably what actually happened" feels about right.
Also don't underestimate someone that is smart and well intentioned but dangerously overconfident and completely unaware.
These may be the toughest for manager types to sniff out because they believe their own hype.
"Does this solve a problem we have?"
"Of the top five things that we are trying to build, does this add any of them?"
Often the problem isn't that the new methods/tech is bad but that all the effort spent transitioning to it could be better spent directly attacking the goal in the first place.
"Of course it does: engineers will be more productive, we'll have less bugs and better performance. {TECHNOLOGY} has been around for {N} years and getting more and more traction. Do you want us to be ahead of our competitors or not?"
"You need to understand that sometimes you have to sharpen your axe before cutting the trees. Have you heard the phrase 'work smart, not hard'?"
"So essentially you're saying we shouldn't change anything despite having issues (bring up any bug/downtime you had recently). John, sometimes you have to escape the comfort zone and learn new technology."
I once worked with a person who talked like this in front of non-technical founder. I bet he did sound quite convincing, it took me a while to cut through the bullshit.
Rather, I'd go for:
"Let's solve this problem we've been having with Azure Cloud Functions!"
"Of the top five things that we are trying to build, let's build them with Agile!"
Otherwise, you are not a team player, part of the problem not the solution, etc. It sucks, but that's politics.
That said, you can figure out (just ask!) if this person was brought in as the result of an already-made decision to move to this type of development, or if this person is pushing it. As an EM, I'd have no problem with a report directly asking that question.
If the answer is the decision was made, and this hiring was a downstream result, then your choice is likely to deal with it or move on. If the decision hasn't been made / this architect is advocating for it, then a cost benefit analysis (which you'd think would be a standard part of any work) will be illuminating. My understanding of the tooling / debugging / monitoring tooling for serverless is that it's still quite raw. Although my bias is towards not wasting innovation tokens on things like serverless / I align strongly with the choose boring technology view advocated by Dan McKinley [1], amongst others.
My $0.02, having used serverless before. Those use cases are:
* Very very low traffic apps. POST hooks for Slack bots, etc.. Works well!
* Someone who is an "architect" can now put "experience with Serverless" on their CV and get hired somewhere else that is looking for that keyword in their CV scans.
Those are any and all uses cases.
This is absolutely a huge step backwards in terms of architecture. And methinks this architect doesn't have the broadest understanding of Azure and is reaching for the easiest tool in the toolbox.
Edit: "serverless" is a broad term. There's a difference between shoehorning every microservice into a lambda-like service (bad), and migrating from a collection of VMs to a managed Azure service that does the exact same thing (can be good).
Then, the email verification is checked by a second function. You click the link, it updates a field in the DB and redirects to the main page.
login is the same, we validate the creds and give a token.
And then the only infra we maintain is the stuff we have always on and serving mass requests. There's a lot of stuff that doesn't run that often that is more or less a single function and running a fleet or redundant machines for that doesn't really add value.
It just depends on how your app is designed as to if it makes sense.
Scales down to 0, so effectively no cost overnight when used infrequently, and if you use a scripting language like Python or JavaScript the startup times are in the ms for a new instance when scaling up.
When I would use K8s would be if there is heavy processing involved, so want a language like Java or C to do the heavy lifting, image processing, encoding or encrypting and would be long running.
Disagree with serverless only being for low traffic, I've used it for a a fair amount of high traffic situations and it is great for scaling up quickly.
AWS Lambda and Google Cloud Functions support containers.
> and would be long running.
Unless it's running 24/7, you're better of with serverless batch processing systems or you'll need scale-to-zero on your Kubernetes cluster.
I would add, that it offers an extension point / hook-in for a variety of AWS products. I can add custom functions to be called when a Cognito user is created, updated. I can add custom functions to be called to authorize routes transiting API Gateway.
At "The Firm" - we're going heavily in on AWS for our architecture - Fargate for k8s on top of the usual RDS, S3, etc.
We've begun roll-out of an external-facing API authenticated with OAuth2. Our backing store here is Cognito and our routing happens via API Gateway. We use the API Gateway's baked-in Cognito token authorizer for incoming requests, which route to pods hosted in Fargate. These pods represent a variety of projects.
As such we have two Lambda functions providing the /oauth/secret and /oauth/token endpoints for exchange of One-Time-Password for a secret token and secret token for access tokens - they experience very low N call rates (dozens?) of times per day and don't fit into any given API we host via Fargate.
For example:
- resizing images - we can do a size per lambda all in parallel. This means we can process images quickly (with minimal latency) without having to have loads of slack memory & CPU on our backends
- queue processing - we have an app that needs to copy files from user provided URLs to S3. We do this by dumping them in a SQS queue and having a lambda fire for each queue item. Means we can do lots in parallel without filling up an EC2/fargate instance’s network port
- dynamically processing images using Lambda@Edge & Cloudfront - similar to my first one, but on the fly when requested, instead of ahead of time
* Good for "minimally dynamic" apps
* Easier in compliance/regulation-heavy environments (the kind that impose burdens on OS configuration and maintenance)
* Like you said, great for low traffic apps (the kind of things we usually build, toss over the fence to another department, and then never look at again)
* Also like you said, resume driven development
We were hosted on AWS, and pretty much only used Lambda + API Gateway + DynamoDB.
Lambdas and workflows are like SQS and workflows. you get a few of them just use lambdas and workflows, you have a lot of them switch to a workflow engine.
I am an old school kind of engineer, and if you tell me your web requests take 2s instead of 20ms due to an architectural decision you made that doesn't have any other strong upsides, I would agree that that is insanity.
We've used AWS Lambda for about 4 years, and it's been so good and so cheap that I'm shifting literally everything (except Redis) to serverless. Also, GCP has a better serverless offering (Cloud Run, Spanner), so we're switching from AWS to GCP to take advantage of that. I bet we're going to see a massive cost reduction, but we'll see.
Things I like about serverless (again, from the perspective of a very small startup, with 5 engineers, and me being the primary architect):
* It's so liberating to not worry about EC2 servers and autoscale and container orchestration myself. All our Cloud Formation templates add up to around 3,000 lines, which maybe doesn't sound like a lot, but it's a lot. There are tons of little configuration things to worry about, and it adds up. (Not to mention the sheer amount of time it took to learn.) ECS Fargate takes care of some of this, but it doesn't autoscale based on demand or anything (not without settings things up yourself). (This is a big reason why I want to switch to GCP: Cloud Run is like Fargate in that it runs containers, but unlike Fargate it autoscales from 0 based on load.)
* It's very cheap in practice, at least for loads like ours that respond to events: API services that sometimes see a lot of use and sometimes see very little use; queue consumers sometimes have a lot to do and sometimes have very little to do. AWS Lambda bills down to the milisecond in terms of resolution, and GCP Cloud Run/Cloud Funcitons bills down to the next 100 miliseconds. These are very fine resolutions and for us at least, we've seen costs be small.
* For database serverless products (like DynamoDB for example), it's very liberating to never have to think "Hm, do we have enough CPU provisioned?"
Things I don't like about serverless
* Pushing source code sucks. Lambda will just one day decide your version of Python or whatever isn't good enough and force your customers to upgrade all their user-written code to the latest Python version. (But! Cloud Run supports containers, and so this won't be a problem.)
Every team I've known that adopted Lambda + DynamoDB (or equivalents) gave up on running their app locally, adding a lot of friction to the development process.
If you need more complex cases like deploying docker containers to a Lambda function, take a look at AWS's SAM library. Also supports local dev _and_ makes deployment easy (its essentially a wrapper around Cloudformation so its very powerful).
I'm in the early stages of this rearchitecture but so far I've had no difficulty with local development.
We used GCP at our previous startup (sold ) and ran our own K8S, when it was very new (2015). There were lots of pains in those days. So when we started our current startup in 2018, we started with App Engine (flexible, which supports containers). This was fine, but lots of drawbacks. After a year or two we ended up back on K8S, using GCP's GKE (managed K8S). Our team is pretty good with K8S, so it was fine. But regardless, the little stuff adds up.
Fast forward to about 6 months ago. We had used GCP's Cloud Run off and on for little stuff, and it kept getting better. One day someone asked the question why we shouldn't just use it for everything. Everyone was a bit defensive, but we kind of stared at each other and couldn't think of great reasons (for our use case), so we tried it.
Our setup consists of a primary API service (Golang), and a dozen or so smaller microservices, mainly in Python. We even moved most of our React apps to cloud run.
6 months in, and I can't really say anything bad. We turn off scale to 0 for the services where it matters. It scales up quickly to loads, zero down time over 6 months, no troubleshooting (so. much. time. saved.), super easy to deploy, swap traffic between versions, etc.
I'm not saying it a silver bullet, nor that it's perfect for everyone... but I couldn't say enough good things about _container-based_ serverless like Cloud Run.
That said, breaking big systems down to the function level (Lambda, GCP Cloud Functions, etc) sounds like a nightmare to me. I'm sure there are ways, but that's a different ballgame. We do use FaaS for some tasks.
YMMV.
Edit: Oh, and our hosting bill went from ~$5k a month to $500 a month (in part to other things, but primarily the lack of need for big node pools.)
I am starting to evaluate AWS for GCP for serverless. What, in your opinion, makes GCP better? Is the comment in the context of containers or functions?
GCP Cloud Run is like the best of both worlds between AWS ECS Fargate and AWS Lambda. (Yes, the comment is in the context of containers. Sort of.)
* Like Fargate, Cloud Run hosts containers and takes care of figuring out where they actually live. Unlike Fargate, you don't have to say exactly how many containers you want running at once; GCP will automatically scale the # of containers up and down based on HTTP load and will scale down to 0. This should make Cloud Run cheaper than Fargate. (If you want to hook up Fargate to a webserver and you don't have autoscale figured out, you'll have to keep a lot of workers alive doing nothing.)
* Like Lambda, Cloud Run bills by the amount of time spent processing at least one request. But unlike Lambda, Cloud Run lets one container handle more than one request at a time (it sucks to have to spin up a lot of Lambda invocations that do a bunch of IO). Web servers that are good at concurrency shine here. This should save money.
* Cloud Run has more generous limits in many respects than Lambda. Cloud Run lets you set up SIGTERM hooks, so you can do some cleanup logic in your container (to e.g. write performance data to a timeseries table or whatever).
That's Cloud Run. On the database side: GCP Firestore is very interesting and we're going to build a big feature around it. AWS has nothing like it. On the queue side of things: We're planning to build around GCP Cloud Tasks; We've more or less built Cloud Tasks ourselves using a mix of MySQL and AWS SQS (and it was hard and we haven't done a good job).
I'd love to start a Discord or something to discuss these thoughts more. It's so hard to get good practical information for system architects/CTO types who just need to hammer stuff out.
Been working on a lambda, dynamodb, typescript react app for about 6 months for work and it's been just mind-melting how much money and complexity we've saved switching. I'm talking like 5x the cost drop and we can more easily onboard devs to the project because it's just simpler, no need for a devops hire honestly.
The only thing that comes to mind better isolation between parts of the app, but this could be achieved with any architecture if done right.
A new company called Momento just launched that offers a serverless cache. Might be of interest to you.
The deployment story looks good though, with the `sam deploy`, but for now I can't get over the developer experience.
Also, don’t put any logic in the handler itself unless it’s extremely trivial.
Everywhere else, I went with traditional deployment of a monorepo.
You only have 100 concurrent users. You do not need serverless. You could serve 100x that amount easily with a simple nginx reverse proxy to your webserver.
This is almost comical enough for me to suspect that it is satire, but unfortunately there are too many examples out there of this type of thinking. It's just infra bloat and a waste of money.
What problem does migrating to a new architecture solve? Does the current deployment have scaling, maintenance, or other troubles? Going from something that's broken to something that works is one thing, but going from something that works to something else that works is pointless unless there are tangible benefits.
If the only reason is to make the Cloud Architect feel better or pad people's resumes, the correct answer is no.
I think Serverless is good in some areas, S3 and Dynamo are both good products for example.
I have a few big issues with serverless: 1. It is harder to develop for. Sure you get to ignore server configuration but honestly a well made infra team should be removing that concern for the development team anyways. The problem is when you are running it locally you so rarely can actually run the code. So setting up things is annoying, especially when you get into the final stage of serverless which is some object lands in SQS which fires a lambda, which puts puts another object in another queue which fires another lambda which load s3 which writes to a db, etc. This all ends up more complicated than just writing an application for it, but its harder to develop for it. Often the only way to actually run this stuff ends up being setting up the whole infra in the cloud and running it through that way, so that means dealing with deploys, and you lose a lot of debugging ability.
2. It doesn't save money. A single lambda that runs quickly on some event does save money vs a server all of the time. But most companies seem to over-provision servers so that's easier. But once you include the prod environment, dev environment, and the serverless things running all of the time, it does not save money since often 100 lambdas could be a single instance.
3. It doesn't save time. Developer messing around with setting up hundreds of new services and the corresponding rules and configurations and deployments and cicd pipelines , I don't think saves dev time vs a normal well maintained infra with servers and a good a cicd pipeline. Often the time savers are vs like manually configured bare metal servers moving to serverless, but there are better ways to save time.
You should never need leave your IDE to complete a change>execute>change inner loop. https://modal.com/docs/guide/ex/hello_world
Key concerns:
1. You're locking your platform into one supplier permanently. The exit fee is starting again. Literally burn it to the ground.
2. You're going to introduce problems when you migrate it. The ROI is negative if you spend money and achieve more bugs without improving functionality.
3. The cost estimation of every pure serverless platform is entirely non-deterministic. You can't estimate it at all even with calculators galore.
4. The developer story for serverless applications is quite frankly a shit show. Friction is high, administrative host is high and the tooling is usually orders of magnitude slower and more frustrating than local dev tools.
5. It's going to take time to migrate it which is time you're not working on business problems and delivering ROI to your customers.
As always ask yourself: is the customer benefiting from this now or in the future? If the answer is no or you don't know, don't do it!!!! Really sit down, find a sound business decision analysis framework and put all the variables in and watch it melt instantly.
All you're going to do here is put a "successful" (pah!) project under the architect's belt before he pisses off and trashes someone else's product.
As a somewhat extreme opposite of this, I would at this point never allow my cloud estate to progress past portable IaaS products and possibly Kubernetes control plane management. Anything else is a business risk.
* Scalability is real. We have some bursty traffic, sometimes with extreme burst, and we've had no problems scaling to meet that need.
* Our traffic is still predominately during business hours in the U.S. That's an extremely important point - because our site is effectively being used for only 12 hours or so per day. The remainder of the day and on weekends it's unused. We looked at the cost of using EC2 instances and Elastic Beanstalk and the full serverless is still cheaper.
What we've discovered in our cost analysis is if you have a site that's hit 24x7, 7 days per week then you'd be better off hosting on EC2. If your traffic is constant and there's not much variability over time then it may make more sense to host on-prem. In our case we have highly variable traffic during standard business hours. Serverless is the way to go for that scenario.
I think most people don't realize just how "burstable" their own traffic is. If you're looking at graphs with one-hour resolution, remember that AWS bills for lambda at 1ms increment. Not sure about Azure, though.
It was in EdTech, so we had students downloading assignments, uploading results (lambda-fronted S3 for blob storage, DynamoDB for data), administrators paging through result, grading things, students uploading images and videos of themselves, administrators reviewing them, many lambdas being triggered by changes to S3 or DynamoDB tables, or SNS messages sent from other lambdas.
I don't think most people would consider it especially bursty, except in the most general sense around midterms and finals, but in truth even a "heavy user" is only hitting APIs every so many seconds at most, and like I said, AWS lambda bills at the millisecond level.
My company has moved several "regular" websites to serverless. In fact, we just took the existing websites (which were often Django, sometimes huge ones) and dumped them into Lambda. The exact opposite of every "what serverless architectures are for" article you've ever read. And you know what?
It's awesome.
It's way cheaper than running it on EC2, and I never have to reboot a server or worry about their disks filling up or anything. Then when traffic spiked hard during the covid lockdowns? I did nothing. Lambda just handled it.
The only serious change we made to the setup was preloading a percentage of machines at all times to remove cold starts.
I'm not saying it's trivial (zappa, serverless, CDK), but usually one guy gets it working and the rest of the dev team changes nothing at all.
Serverless functions are great if you have a lot of small services that need to be "on standby at all times".
For example if you have 5000 separate services, it doesn't makes sense to have them all running all the time if 4,000 of them have very low traffic. So one of the main benefits is that you get the ability to "increase your library of services at a very low cost". Serverless also really shines with quick stateless actions.
However, converting an app to all serverless is a huge task and for most apps it doesn't make sense.
Two major drawbacks:
1) You're bound to only the language versions that are currently supported.
2) You're writing code specifically for the platform so without a heavy lift you're "locked in"
If the goal is to go serverless and get rid of the server management, I'd suggest looking into containerizing your existing apps and deploying on a "serverless" managed service like Fargate (or your favorite cloud provider's equivalent). This approach is also lets you go to a different cloud provider if you want... or even move back to your own datacenter with no code changes.
Just don't get attached to company and don't work late.
Whether or not it's for you has a lot to do with what's important to you. How much weight you put on runtime cost, versus ease of development / deployment, versus whatever other benefits it might bring (integrated logging, monitoring, multi-region deployment, etc). And, of course, whatever downsides it brings...like cold starts, team ramp up on how it works, etc.
You are right that your use case doesn't make sense for going full serverless. An application with heavy and predictable usage doesn't gain anything by becoming serverless. All you're doing is raising your cloud hosting bill.
You can scale down app service/VM based infra either on a timer or in response to metrics, so it's not like serverless is your only option if cost is the motivator.
The question I'd have is what is the driving force behind the initiative to move to Functions? If the answer is "reduce infrastructure costs", I'd ask serious questions if the "juice is worth the squeeze" for a transition, and then create estimates for the cost of transition versus cost savings. For a 100 user app it is likely not going to have much payoff unless your infrastructure bill is a lot higher than I expected ($10-12k).
However, if the answer is "to create a better integration architecture between our apps and services" then you should engage with that. Azure Functions pushes developers really hard toward creating APIs that are discoverable and reusable, especially in a Microsoft oriented enterprise where you start seeing other tools like logic apps or the power platform start being able to produce and consume for custom functions. Over time, I've watched benefits accrue from common integration points functions drive across the organization.
So, ask questions, but make sure you understand what the organization is trying to achieve with the recommendation.
Disclaimer- I'm a Microsoft employee, but opinions my own.
Our current hosting costs for the projects I work on are about two orders of magnitude below your "worth it" cutoff :)
The integration stuff you mention is indeed very interesting, thank you for mentioning. I can think of a couple projects that would would really benefit from Functions in this way. Our architect is mainly concerned with scalability here, however.
Some of the possible reasons why you'd go down this path would include 1. Cost optimisation, generally not a good driver unless you have very spikey workloads (which you don't) 2. Resilience/availability, this is a pretty good driver especially if you've had issues recently, moving to serverless takes away almost all maintenance tasks and solves a lot of potential problems
Some of the main trade-offs include 1. Developer velocity, generally it's pretty hard to debug locally which you will need to spend time working out how to do it or do your debugging in the cloud 2. Cold start ups, this can be largely solved with solutions such as GraalVM however you do need to invest time to implement these solutions 3. More complex internal application architecture, you need to either deploy your entire application as a single function or break it out into multiple and you'd need to do the analysis of how it should work and the performance tuning of each option
That being said, I find the best way to look at these situations from a political perspective is to have a quick chat with the architect and look to understand what problem he's trying to solve and for you to mention your costs and take it into a cost/benefit discussion
If he says its for cost benefits, you could say it will be a x week migration timeline, which has a developer salary cost of y and delay a new feature which is expected to bring z revenue, 10-20% operational delay to pushing out new features. So what would be the cost saving total and ROI?
If there is no local state than serverless is a feasible solution, if not the best. If there is then you need to find some substitute for that local state and the case for serverless is much worse.
It's common for servers to take a while to start up, it's common to see the first request to each endpoint take a bit longer, it's a common optimisation to add keepalive to downsteam services, or tweak your database connection pooling. These issues are normally straightforward optimisations, and startup time isn't usually too much of an issue. With FaaS platforms these become significant hurdles that take engineering work to overcome, require introducing more services, more cost, etc.
Operationally: You need a few more/different specifics to avoid talking in generalities. How many requests per second? What's the floor? What's the peak? How bad a sudden surge ("thundering herd") do you ever see? What's the heaviest request / worst case response time?
Then you can start comparing the two solutions under various scenarios. How much will our average RPS cost us? Will the service deal well under very low or very high load? What happens when your worst-case thundering herd hits? Does your heaviest request fit comfortably within limits?
I tried hard to think of advice on how you may remove the glasses if that is the situation here, but in honesty it is a tricky one. It is akin to a little worldview; bubble, and those are tricky to try and actively shift in others (an attempt at a suggestion: I think perhaps it would be best to come into team discussions around this not as being on "one side" but rather as being the reasoned, dispassionate expert on all sides; the whole question).
I say this is someone who up until recently wore the glasses (in my case it was for Kubernetes) - it took me a failed project to take them off, I hope that does not happen to you.
Make sure the serverless model include all gimmick you currently have such as firewall, waf, cache, ssl termination, load balancer, current traffic levels etc.etc.
The title says it all, they might as well work for Azure.
... It's likely everything is an Azure shaped nail to them, do not trust them.
But if you compare Azure Functions with stored procedures in a DB, then it's pretty cool to have a kind of hot swapping at the function level.
I'd be cautious, but with a gradual migration there's hopefully time for reflection as well. Going 100% on anything is rarely a good idea, so hopefully your architect isn't religious about this.
Serverless can also mean something like EKS with Fargate. You get to use Kubernetes without managing any servers. Azure AKS has something similar with virtual nodes as I understand, though I haven't used them. I do think this model is better for long running services than serverless functions.
Then if you ever have a bug, you can easily set a breakpoint, step through the entire execution, and fix it.
Agree that this MUST always be possible whatever architecture changes occur.
Then have a flag/config to allow specifying certain things to communicate over network instead of as function calls, and using serverless functions.
I've NEVER seen teams do this though. It's like no one can imagine they will write a bug.
We've dealt with "new guy wants to overhaul ..." scenario. When I joined this company we were a C++ shop with some Perl and bash. Multiple new recruits successfully lobbied to implement refactors, or new projects in a hot language/framework. Several of the refactors were a huge waste of resources that either didn't come to fruition, or were only partially successful.
Now, we are a Perl shop with active development in 3 other languages(not counting front end), and we're maintaining legacy apps in an additional 4 languages. And we've deprecated apps in at least 3 additional languages.
I guess I should be thankful none of them have lobbied for switching databases. :-O On any given year, we average 3-4 programmers and 2-4 contractors(mostly front end) Two of us have been there 15+ years, but the other full timers seem to move on around the three year mark. Because of that all the hot shots have left. When a major bug is discovered in their code it can take a long time to fix, and any breakage due to upgrades is quite a hassle since those of us left aren't experts at every language we have to maintain.
1. The same stocks in $cloud_platform_provider that your architect has bought.
2. A bunch of certifications for $cloud_platform_provider so you also want to lock everyone and their mother down into that platform.
What benefits does the "cloud architect" say the migration will bring? It sounds like you have a reasonable backend api setup that works. There needs to be a strong motivation to do a migration like that.
I'm also not convinced you're at a scale where you need a cloud architect, but it's hard to say from your description. I bet their main motivation is delivering a project that justifies their role.
We run a lot of services in Kubernetes, some of those services also run background jobs (same container serving both HTTP and doing bg processing). I want us to migrate background jobs from our containers to a dedicated platform (e.g. Lambdas), because we can scale to 0 when not needed, we'll offload our Kubernetes cluster (our cluster will serve only HTTP traffic that is easy to scale for us) and if done right, we should have better debuggability/observability. Also right now we orchestrate our jobs with redis which means we need a redis instance for each service with bg jobs, but I want to move orchestration to a separate service that will store the data in postgres so instead of running x redis clusters we'll just have 1 postgres.
The tricky thing is the rewrite, but frankly, we still need to do it and we don't need to rewrite whole services, just the code responsible for bg jobs.
Here are questions to ask
What is the current monthly spend? What is the estimated monthly spend in the new system?
Perhaps the new serverless system is easier for operations and deployments. Does the new system provide for better uptime/monitoring? How is monitoring done on the current system? If there is a problem, like the service returning 500s, do you have the tooling to diagnose the issue? How does this change in the new system?
What is the developer experience on the new system? Is it easy to deploy to staging and production environments? How long does it take to create a new feature? What does the develop/test/debug loop look like in this system? How does this compare to the current system?
Ask yourself and others these type of questions. Maybe migrating to serverless is better, but it should depend on the answers to questions/concerns that I listed above.
In the azure world, a more modern option going forward is azure container apps which just run docker containers, but you still will have 8+ second cold starts and will need to run at least a single instance full time, but it's cheaper than functions premium. Also would suggest looking at an evented architecture using dapr which is built into ACA. In the GCP world cloud run is frankly amazing.
If I was starting from scratch, I'd use serverless. If you're migrating everything, I think that is a giant project that needs justification. I'd ask, "What specific current problem do you have that it would solve?"
Cold boot is only a (minor) issue on the first hit, that's quickly amortized.
Some Async use cases is great but large scale apps becomes clusterfuck. Experience: we made a whole feature on AWS lambda. Sucked. 2 years later its a spring app in a container now.
The question is: what alternative to do you propose. How does your alternative reduces hardware when load is low. How does your alternative orders more hardware when load is high. How much time does it take? What's your plan if your data center is cut from the Internet because of bad router configuration?
A proper alternative for serverless is Kubernetes cluster. It'll likely cost less (for big application) but it'll require more knowledge to properly manage it.
You can use simplistic setup with dedicated server or virtual machines with manual operations, but at your load I'd consider that not appropriate.
Anyway is management decided to hire Azure Cloud Architect, the decision is already taken and I suggest you to relax and enjoy new experience.
Several folks have written about it (Architect Elevator[0] is a good blog on these types of topics, as he routinely talks about tradeoffs and ROI to the business). High Scalability's[1] "what the internet says" posts frequently highlight serverless projects (both pro and con)
_______________
[0] https://architectelevator.com/blog
[1] most recent - http://highscalability.com/blog/2022/7/11/stuff-the-internet...
It's OK... it would work.
It could be somewhat more expensive or less expensive to host, and somewhat more or less performant. (You didn't say where the data lives, but if it's in a database and you aren't doing something extra with it, then this layer might not be that important, one way or the other.)
For a 100-user app, I'm guessing the major cost here is the switchover cost. Whether it makes sense or not depends on details of where you are now and what problem(s) this is meant to solve.
No one here knows that (maybe even you don't either?) so we can't really give you an answer, just some general pros and cons of serverless.
For my own project (uptime monitoring + status pages), I got to about 500 users before serverless costs were eating enough of my profits to make me want to move to VMs. It was nice to be able to validate the idea on a service that costs zero if no one is using it.
With continuously running applications (100 concurrent users), it makes zero sense to use serverless as you're paying a high premium over a continuously running VM. I'd just use a VM and scale the number of instances serving the API.
The main issues: 1. Unpredictable performance - latency (with cold start), concurrency limits (how quickly can we scale to X concurrent requests), etc? We spent many hours with AWS support before moving away from lambda. 2. Short running process are terrible in many ways - no DB connection pooling, no in memory cache.
I'd be much more happy if AWS fixed scale-up speed of ECS tasks so you can scale up your services in a reasonable time, than having these one-shot tasks.
Personally, I was excited for serverless, but after using API Gateway and Lambda to serve a simple REST API it seemed like more work compared to using a load balancer to route requests to a container running in ECS. ECS can autoscale too, so you can scale up and down as required.
But if you have an API that is getting sustained traffic, Lambdas probably aren't your best bet -- you're going to want a container that is always running.
But to be honest, with 120 routes and 100 users, it sounds like Lambdas are a good way to go.
From my experience of 4 years with serverless in AWS following problems have been identified:
- Difficult to debug
- Difficult to collect logs - Lambda@Edge
- Slow cold starts
- Frontend and NodeJS bundling are problematic - size limits, slow and unpredictable problems
- Pricing are difficult to estimate
- Careful planning needed for network and architecture - how lambdas work together
- Workflow orchestration might be needed
What is serverless good for?
- Queue processing
- Event processing
- Internal infrastructure code
This is most likely a waste of time and the "cloud architect", like most cloud proponents, has no fucking clue what they're talking about.
It should mean “keeping state to an absolutely minimum, and relying on event-based architecture.”
Are you familiar with event-based architecture? Are you familiar with functional programming?
This is your time to shine.
There’s a strong possibility you’ll end up with Lamdas (or whatever) that are just CRUD endpoints.
That would be bad.
So be prepared to fight out what comes next.
- Easily scalable/autoscaling
- Drastically reduced operations/maintenance/devops overhead
- CI/CD can be much simpler
- Observability is built in (metrics, logging, alerting is built in)
- Built in connections to other cloud products
If you can stomach the vendor lock-in then it might not be so bad.
The advantages are:
* Lower costs from much better resource utilization rates. Comparisons against a perfectly sized fleet of servers is inherently flawed. Sure, you can make sure auto-scaling happens, but that costs time and energy to get right. Even then, you're always going to be having to leave some buffer room. Instead of saying serverless is good for bursty/low traffic, I'd frame it as serverless is great for any workload that isn't close to a fixed load. Dev and other non-prod environments also basically cost nothing instead of potentially being quite expensive to replicate multi-AZ setups. In practice, serverless is going to be cheaper for a lot more use cases than you may think at first.
* Tight integration with IaC. Your application and infra logic can be grouped based on logical units of purpose rather than being separated by technology limitations. This is especially true if you use things like CDKs.
* Zero need to tune until you get to massive scale. We went from our first user to hundreds of thousand of users with no adjustment needed at all. Even at millions of users, there's little you'd need to change from the infra side beyond maybe adding a cache layer and requesting limit increases. Obviously app/db optimizations might be needed, but for the most part, scaling problems become billing problems.
* A simpler threat model. If you're running servers, keeping them secure is not trivial. There's just a lot to less to do to keep serverless apps secure.
* Ability to avoid Kubernetes and other complicated infra management. One could argue that you're just trading Kubernetes complexity for cloud specific complexity. That's true, but it's still a net reduction in complexity.
* Operational overhead is way down. A base level of logging/tracing/metrics comes out of the box (at least on AWS, not sure about Azure). No need to run custom agents for statsd/collectiond/prometheus/opentelemetry/whatever. No need to spend any time looking at available disk space metrics or slow-building memory leaks that creep up over weeks. It just works.
* Easy integration with lots of cloud managed services. Want to deploy an API endpoint? Want to build a resolver for an AppSync GraphQL field? Want to write code that runs in response to some event or alarm going off? Want to process messages from a queue without spinning up a fleet to longpoll from it? Want to write code that applies transforms on a data stream before writing to your data warehouse? The infra definitions for all of these all share a foundation. You have a unified API for everything.
Having your API's in a bunch of different App Services is sort of a bad idea. You can do it, but you're likely going to have "fun" with how much complexity is involved with setting up the VNETs, Private Endpoints, Custom Domains, DNS stuff and different Subnets that can't be shared across App Service Plans for all those apps and their deployment slots. You're likely also going to be a significantly higher price for it than the alternatives, especially if you use containers, but it's "significantly higher" in a way that's "unimportant" because it's likely peanuts compared to developer salary, total IT expenses and so on.
That being said, an Azure Function App is still an Azure App Service, so unless your Architect means that you should consolidate your different backend App Services into fewer Function Apps, then I don't see the benefit. If you're unsure what I mean by this, it's that you can replace the 60 API routes with 60 functions in an Azure Function App.
> I'm skeptical - I was under the impression that serverless was for small "burstable" apps with relatively low traffic, or background processing.
You're not correct about this. They scale just fine, and they can handle huge workloads, sometimes a lot better than their alternative, though at the cost of locking yourself into your cloud provider.
> The consensus on the internet seems to be "serverless has its use cases" but it's not clear to me what those use cases are.
I can't speak for AWS, but the basic way to view an Azure Function is to use a simple Express NodeJS API as an example. In a standard Azure App Service you're going to write the Express part, you're going to write the routes and you're going to write middleware for them. In a standard Azure Function App you take the Express part out, because that part is handled by the Azure Function.
Azure Functions have the benefits of integrating really well with the rest of Azure, and in many cases can be really good. It's also much easier to work with them because you don't have to care about the "Express" part and can simple work on the business logic. The downside is that you're limited to what Microsoft puts in the Azure Function functionality, and that you lock yourself into Azure.
With C# you further have to consider whether you want to run your Azure Function as an Azure dotnet, or an dotnet-isolated. Again dealing with the degrees of which you'll want to lock yourself into Azure.
> So what should you do?
I think your Cloud Architect should look into Azure Container Apps, or AKS if you want less lock-in. Both are kubernetes, but Azure Container Apps sort of handle the heavy lifting for you, again, though with some of the highest lock-in that you'll find in any Azure product.
It depends a little on your actual circumstances, but generally speaking, your backend service will have an easier life in AKS once you're up and running. I wouldn't personally touch Azure Container Apps, but I'm in a sector of EU where we might be forced to leave Azure. If you're not, it's a much easier road to kubernetes greatness than AKS.