As in, "we have a PHP monolith used by all of 12 people in the accounting department, and for some reason we've been tasked with making it run on multiple machines ("for redundancy" or something) by next month.
The original developpers left to start a Bitcoin scam.
Some exec read about the "cloud", but we'll probably get just enough budget to buy a coffee to an AWS salesman.
Don't even dream of hiring a "DevOps" to deploy a kubernetes cluster to orchestrate anything. Don't dream of hiring anyone, actually. Or, paying anything, for that matter.
You had one machine ; here is a second machine. That's a 100% increase in your budget, now go get us some value with that !
And don't come back in three months to ask for another budget to 'upgrade'."
Where would someone start ?
(EDIT: To clarify, this is a tongue in cheek hyperbole scenario, not a cry for immediate help. Thanks to all who offered help ;)
Yet, I'm curious about any resource on how to attack such problems, because I can only find material on how to handle large scale multi million users high availability stuff.)
Usually, your monolith has these components: a web server (apache/nginx + php), a database, and other custom tooling.
> Where would someone start ?
I think a first step is to move the database to something managed, like AWS RDS or Azure Managed Databases. Herein lies the basis for scaling out your web tier later. And here you will find the most pain because there are likely: custom backup scripts, cron jobs, and other tools that access the DB in unforeseen ways.
If you get over that hump you have done your first big step towards a more robust model. Your DB will have automated backups, managed updates, rollover, read replicas etc. You may or may not see a performance increase, because you effectively split your workload across two machines.
_THEN_ you can front your web tier with a load balancer, i.e. you load balance to one machine. This gives you: better networking, custom error pages, support for sticky sessions (you likely need them later), and better/more monitoring.
From thereon you can start working on removing those custom scripts of the web tier machine and start splitting this into an _actual_ load-balanced infrastructure, going to two web-tier machines, where traffic is routed using sticky-sessions.
Depending on the application design you can start introducing containers.
Now, this approach will not give you a _cloud-native awesome microservice architecture_ with CI/CD and devops. But it will be enough to have higher availability and more robust handling of the (predictable) load in the near future. And on the way, you will remove bad patterns that eventually allow you to go to a better approach.
I would be interested in hearing if more people face this challenge. I don't know if guides exist around this on the webs.
Here's the cliffs notes version for your situation:
1. Build a server. Make an image/snapshot of it.
2. Build a second server from the snapshot.
3. Use rsync to copy files your PHP app writes from one machine ('primary') to another ('secondary').
4. To make a "safe" change, change the secondary server, test it.
5. To "deploy" the change, snapshot the secondary, build a new third server, stop writes on the primary, sync over the files to the third server one last time, point the primary hostname at the third server IP, test this new primary server, destroy the old primary server.
6. If you ever need to "roll back" a change, you can do that while there's still three servers up (blue/green), or deploy a new server with the last working snapshot.
7. Set up PagerDuty to wake you up if the primary dies. When it does, change the hostname of the first box to point to the IP of the second box.
That's just one way that is very simple. It is a redundant active/passive distributed system with redundant storage and immutable blue/green deployments. It can be considered high-availability although that term is somewhat loaded; ideally you'd make as much of the system HA as possible, such as independent network connections to the backbone, independent power drops, UPC, etc (both for bare-metal and VMs).
You can get much more complicated but that's good enough for what they want (redundancy) and it buys you a lot of other benefits.
Having said that, I have done something very similar for large pools of terminal services session hosts. (Think of a Windows box with a special license that allows multiple remote connected desktop users, and 100 pre-installed GUI applications.)
For web apps, you almost always want either of the following:
- A central file share or NFS mount of some sort, with the servers mounting it directly. Ideally with a local cache that can tolerate file server outages and continue in read-only mode. These days I use zone-redundant Azure File Shares for that. They're fully managed and scale to crazy levels. On a small scale they're so cheap that they're practically free, but have the same high availability as a cluster of file servers in multiple data centres! This is a good approach if your web app writes files locally in normal operation. If you need to distribute an app like this without rewriting that aspect, a central file share is the easy way.
- An automated deployment from something like Azure DevOps pipelines or GitHub Actions that builds VMs one at a time. Both are free in most small-scale scenarios. (For PHP, deployment is just a file copy, so a bash script triggered from a management box is sufficient!) The problem with the "sync stuff around" approach is that corruption gets copied around too. Small one-time mistakes become "sticky" and never undo themselves. Junk files accumulate, eventually causing problems. This method solves that.
Additionally, in all modern clouds you can run "plain" virtual machines in scale sets, where the instances can be scaled out. The scaling part is actually not so important! The key bit is that this will force you to fully automate the VM deployment process, including base OS image updates. Rolling upgrades become easy. Similarly, you can undo the damage done by a malware attack by simply scaling to zero, and then scaling back up. This approach is totally stateless, so you don't need to worry about backing up the VMs. Just rebuild on demand.
But all of that is just a lot of manual labour. It's much easier to host simple apps on a managed platform like Azure App Service, which takes care of all of this. The low-end tiers are cheaper than a pair of VMs.
I have to admit, there's something about this comment that makes me sad in a way. Not to say that there's anything inherently wrong with this question or to say that I disagree with you exactly. It's just that I like the idea of computing / hacking being centered more around a mindset of limitless possibilities, exploration, questioning the boundaries of what can be done (as opposed to what should be done?), and not something that's caught up in drudgery like budgets, schedules, and "business stuff."
Sorry, guess I'm just feeling nostalgic for a minute or something (maybe because I've been watching that 5 hour long interview Lex Fridman did with Carmack) and am flashing back to what computing was to me when I first got involved. Back in those days, a paper / book like this would have evoked a "WOW, HOW F%#@NG COOL IS THIS!??!!????" reaction from me. And I guess it still kinda does in a weird sort of way, even though I also have to deal with budgets, schedules, and the drudgery of the business world. sigh
Then you can follow along parts 2, 3 &4 to scale up by factors of ~10 or more -
https://aws.amazon.com/blogs/startups/scaling-on-aws-part-2-... https://aws.amazon.com/blogs/startups/scaling-on-aws-part-3-...
My pet peeves with distributed and ops books is that they usually start by laying out all those problems, but then move on to either :
- explain how Big Tech has even bigger problems, before explainig how you can fix Big Tech problems with Big Tech budgets and headcound by deploying just one more layer of distributed cache or queue that vietually ensures your app is never going to work again (That's "Desifning Data Intensive Applications", in bad faith.)
- or, not really explain anything, wave their hands chanting "trade offs trade offs" and start telling kids stories about Byzantine Generals.
Like a bonsai tree. There’s a point where you’ve written enough helpers (complete with tests) and abstracted away logic from the views when you suddenly are able to rapidly refactor all of the crap that’s left and when you’re done the resulting codebase can be easily distributed or scaled.
So I’d start by just breaking the data away from logic and then break that data away from the database with the idea being to use a redis server as your apps data model which you can call some function to sync to the database from time to time.
Then build an event logger that encompasses everything (every interaction at least) that happens on the front end (this is trivial with JavaScript on events.)
then spin up two nodes of it and write some function that merges two of these event trees (sorting by timestamp + pick a bias for when two events happen at the same time.)
It won’t scale to 1000 users, and you’ll find kinks to work out along the way. But this is a good start
for your problem you can start by configuring nginx to work as load balancer and spin up 2nd VM with php app
Also, philosophically, I guess, a "distributed" systems starts at "two machines". (And you actually get most of the "fun" of distributed systems with "two processes on the same machine".)
We're taught how to deal with "N=1" in school, and "N=all fans of Taylor Swift in the same seconds" in FAANGS.
Yet I suspect most people will be working on "N=12, 5 hours a day during office hours, except twice a year." And I'm not sure what's the reference techniques for that.
You have it backwards. Salesmen will usually buy you the coffee. Even if you don't have the budget today, they still have an expense account and will usually buy you coffee.
Seriously. Most of the difficulty of distributed systems is because you're actually having to manage the flow of information between distinct members of a networked composite. Every time someone is out of the loop, what do you do?
Can you tell if someone is out of the loop? What happens if your detector breaks?
Try it with your coworkers. You have to be super serious on running down the "but how did you know" parts.
Once you have a handle of the way you trip, go hit the books, and learn all the names to the SNAFUs you just acted out.
I find this comment highly ignorant. The need to deploy a distributed system is not always tied to performance or scalability or reliability.
Sometimes all it takes is having to reuse a system developed by a third party, or consume an API.
Do you believe you'll always have the luxury of having a single process working on a single machine that does zero communication over a network?
Hell, even a SPA calling your backend is a distributed system. Is this not a terribly common usecase?
Enough about these ignorant comments. They add nothing to the discussion and are completely detach from reality
My point is precisely that transitioning from a single app on a machine is a natural and necessary part of a system's life, but that I can't find satisfying resources on how to handle thise phase, as opposed to handling much higher load.
Sorry for the missed joke.
Step 2: There is no step 2.
Notes on CPSC 465/565: Theory of Distributed Systems [pdf] - https://news.ycombinator.com/item?id=11911402 - June 2016 (9 comments)
> These are notes for the Fall 2022 semester version of the Yale course CPSC 465/565 Theory of Distributed Systems
There are a lot of algorithms, but I don't see CRDTs mentioned by name. Perhaps it's most closely related to "19.3 Faster snapshots using lattice agreement"?
Wrong level of abstraction. This is clearly a lower level course than that and discusses more fundamental ideas.
A quickie look through chapter 6 reminds me of CRDTs, at least the vector clock concept. Other bits from other parts of this course probably need to be combined into what would be called a CRDT.