And the solution to this problem is to slowly, rate-limited, bring the service back online, rather than letting the whole thundering herd go through the door immediately.
Your problem is a real problem though. Where I worked, we would call that backlog, and we would manage it with 'floodgates' ... When the system is broken, close the gates, and you need to open them slowly.
In an ideal world, your system would self-regulate from dead to live, shedding load as necessary, but always making headway. But sometimes a little help is needed to avoid the feedback loop of timed out client requests that still get processed on the server keeping the server in overload.
I think the article just used this phrase to describe something else. (Great article otherwise).
The short version is that when you have multiple processes waiting on listening sockets and a connection arrives, they all get woken up and scheduled to run, but only one will pick up the connection, and the rest have to go back to sleep. These futile wakeups can be a huge waste of CPU, so on systems without accept() scalability fixes, or with more tricky server configurations, the web server puts a lock around accept() to ensure only one process is woken up at a time.
The term (and the fix) dates back to the performance improvement work on Apache 1.3 in the mid-1990s.
That... doesn't have much to do with the thundering herd problem. It also doesn't make much sense as a concept on its own merits -- say you come in to work and your inbox is full enough for three inboxes. Does that fact, in itself, mean that you decide you're done for the day? No, it just means you have a much longer queue to work through than usual.
The thundering herd problem refers to what happens when (1) a bunch of agents come to you for something while you're busy; (2) you tell them all "I'm busy, go away and come back later"; and (3) the come-back-later time you give to each of them is identical, so they all come back simultaneously.
And that's exactly what's happening here, except that instead of giving each worker thread a come-back-later time when it asks for work, you're receiving work, sending out individual messages to every worker saying "hey, I'm not busy anymore, come back RIGHT NOW and get some more work", and then rejecting all but one of the thundering herd that shows up. The reason the Gunicorn docs and the uWSGI docs both refer to this as a "thundering herd" problem is that it's a near-perfect match for the problem prototype. The only difference is that, instead of giving out identical come-back-later times to worker threads as they ask you for work, you tell them to wait for a notification that includes a come-back-later time, and then when you get one piece of work you fire off that notification separately to every sleeping thread, including identical come-back-later times in each one.
If my SLA is 24 hour response time, and the inbox is FIFO, and I can't drop old messages, I'm most likely not hitting the SLA. If they all came in overnight, I'll hit the SLA for day 1, but I will be busy all of day 2 and 3 and never respond on time. If after day 1, I get a days worth of messages every day, I'll never catch up.
One solution is to use a soft starter which slow brings the motor up to speed.
If anybody is interested, I've packaged both as Docker containers:
HAProxy queuing/load shedding: https://hub.docker.com/r/luhn/spillway
nginx request buffering: https://hub.docker.com/r/luhn/gunicorn-proxy
* It does have an http_buffer_request option, but this only buffers the first 8kB (?) of the request.
Works amazingly well! We run our python API tier at 80% target CPU utilization.
In gunicorn, `sync` mode does exhibit a rather pathological connection churn, because it does not support keep-alive. Generally, most load balancing layers already will do connection pooling to the upstream, meaning, your gunicorn processes won't really be accepting much connections after they've "warmed up". This doesn't apply in sync mode unfortunately :(. Connection churn can waste CPU.
Another thing to also note is that if you have 150 worker processes, but your load balancer only allows 50 connections per upstream, chances are 100 of your processes will be sitting there idle.
Something just doesn't feel quite right here.
EDIT: I do see mention of `gthread` worker - so you might be already able to support http-keepalives. If this is the case, then you should really have no big thundering herd problem after the LB establishes connections to all the workers.
Sounds like an app like clubhouse might have lots of small, fast responses (like direct messaging), where very little of the response time is spent in application code. Does your API happen to do a lot of CPU-intensive stuff in application code?
2. our load balancer buffers requests as well
> In fact, with some app-servers (e.g. most Ruby/Rack servers, most Python servers, ...) the recommended setup is to put a fully buffering webserver in front. Due to it's design, HAProxy can not fill this role in all cases with arbitrarily large requests.
A year ago I was evaluating recent version of HAProxy as buffering web server and successfully run slowloris attack against it. Thus switching from NGINX is not a straightforward operation and your blog post should mention http-buffer-request option and slow client problem.
Of the 3 main languages for web dev these days - Python, PHP and Javascript - I like Python the most. But it is scary how slow the default runtime, CPython, is. Compared to PHP and Javascript, it crawls like a snake.
Pypy could be a solution as it seems to be about 6x faster on average.
Is anybody here using Pypy for Django?
Did Clubhouse document somewhere if they are using CPython or Pypy?
When using something like Golang, I have apps doing normal CRUD-ish queries at 10k QPS, on 32c/64g machines. For most web apps, 10k QPS is much more than they will ever see, and the fact that it is all done in a single process means you could do really cool things with in-memory datastructures.
Instead, every single web app is written as a distributed system, when almost none of them need to be, if they were written on a platform that didn't eat all of their resources.
People don't use python because they want performance. People use python because of productivity, frameworks, libraries, documentation, resources and ecosystems. Most projects don't even need 10k qps, but instead most projects do need an ORM, a migrations system, authentication, sessions, etc. Python has bottle tested tools and frameworks for this.
I've been told off in code review for using Python's concurrent.futures.ThreadPoolExecutor to run some http requests (making the code finish N times faster, in a context where latency mattered) "because it's hard to reason about".
I specifically said "market share", not "best" or "favorite".
https://www.wappalyzer.com/technologies/programming-language...
You can compile it to JS or to Webassembly. But you can do that with every language.
I am glad I am not the only one. I've had so many issues with setting up sockets, both with gevent and uWSGI, only to be left even more confused after reading the documentation.
At a guess, it's probably most loved by people picking old school simple architectures that aren't the sort of thing that goes viral.
I have been through this journey, we eventually migrated to Golang and it saved a ton of money and firefighting time. Unfortunately, python community hasnt been able to remove GIL, it has its benefits (especially for single threaded programs), but I believe the cost (lack of concurrent abstractions. async/await doesn't cut it) far outweigh it.
Apart from what the article mentions, other low hanging fruits worth exploring are
[1] Moving under PyPy (this should give some perf for free)
[2] Bifurcate metadata and streaming if not already. All the django CRUD stuff could be one service, but the actual streaming should be separated to another service altogether.
If not for the troubles they experienced with their hosting provider and managing deployments / cutting over traffic, it possibly could have been the cheaper option to just keep horizontally scaling vs putting in the time to investigate these issues. I'd also love to see some actual latency graphs, what's the P90 like at 25% CPU usage with a simple Gunicorn / gevent setup?
I am also wondering on 144 Workers, on 96 vCPU which is not 96 CPU Core but 96 CPU thread. So effectively 144 Workers on 48 CPU Core possibly running at sub 3Ghz Clock Speed. But it seems they got it to work out in the end. ( May be at the expense of latency )
And on top of that Nginx Plus is also expensive as hell.
I pay for apps, its not a healthy attotude
* use uWSGI (read the docs, so many options...)
* use HAProxy, so very very good
* scale python apps by using processes.
The opportunity cost of spending time figuring out why only 29 workers are receiving requests over adding new features that generate more revenue, seems like a quick decision.
Personally, I just start off with that now in the first place, the development load isn't any greater and the solutions that are out there are quite good.
What I'm talking about is just pointing at something like AppEngine, Cloud Functions, etc... (or whatever solution AWS has that is similar) and being done with it. I'm talking about not running your own infrastructure, at all. Let AWS and Google be your devops so that you can focus on building features.
Now, maybe they could have fixed that issue instead, but going from 29 to 58 workers is easy, it's not the same going to 29,000 to 58,000. And 1000 hosts vs 500 is a non-trivial cost.
one process per container, easy peasy
This is what they did, but because they didn't need to schedule other jobs on the same machine, kubernetes or even docker would be overkill.
In this case, simple VM orchestration seems like a fine solution.
One process per container and multiprocessing is a huge lift most of the time. I’ve done it but it can be a mess because you don’t really have as much a handle on containers than subprocesses because you can only poke them at a distance through the control plane.
It is ridiculous people brag about it.
Guys, if you have budget maybe I can help you up this by couple orders of magnitude.
All of those would affect the answer, and would preclude being able to guarantee "up this by couple orders of magnitude"
And no, it does not require any special tricks. It is regular Java / WebFlux / REST / MongoDB backend service.
CPUs can do really a lot and if your node processes 16 requests per second on a multi-core machine then you are using billions of clock cycles and gigabytes of possible transfer to memory for a single request. Something is not quite right...
It seemed to be all about how to extract the most performance from the lemon they had to deal with.
I found the linked reference really informative too: https://rachelbythebay.com/w/2020/03/07/costly/
Per my experience most applications that mostly serve documents from databases should be able to take on at least 10k requests per second on a single node. this is 600k requests per minute on one node, compared to their 1M per 1000 nodes.
This is what I am typically getting from a simple setup with Java, WebFlux and MongoDB with a little bit of experience on what stupid things not to do but without spending much time fine tuning anything.
I think bragging about performance improvements when your design and architecture is already completely broken is at the very least embarrassing.
> poor hindsight from original developer (co-founder)
Well, you have a choice of technologies to write your application in, why chose one that sucks so much when there are so many others that suck less?
It is not poor choice, it is lack of competency.
You are co-founder and want your product to succeed? Don't do stupid shit like choosing stack that already makes reaching your goal very hard.
They’re all web-capable and blow the doors off PHP, Python, etc.
Good article though! I’ve dealt with these exact issues and they can be very frustrating.
Usually engineering blogs exists to show that there are fun stuff to do in a company. But here it just seems they have no idea, what they are doing. Which is fine, I'm classifying myself in the same category.
Reading the article I don't feel like they have solved their issue, they just created more future problems