The approach uses a ring buffer. If you're not familiar with them, they are a fix-sized array or linked list that you iterate through monotonically, wrapping around to the beginning when you are at the limit. Our ring buffer will hold timestamps and should be initialized to hold 0's -- ensuring that only someone with a time machine could be rate limited before they send any requests. The size of the buffer is the rate limit's value, expressed in whatever time unit you find convenient.
As each request comes in, you fetch the value from the buffer at the current position and compare it to current time. If the value from the buffer is more than the current time minus the rate limit interval you're using, then you return a 420 to the client and are done. If not, their request is ok and you should serve it normally, but first you store the current time stamp in the buffer and then advance the counter/index.
Also, what do you mean by "fixed" memory? Sure, the memory doesn't grow over time, but neither does the memory of the other solutions. Of the solutions listed in the article, this is the most memory-hungry.
Finally, yes, it's not Redis -- but it could be exposed as a service if you wanted that pretty easily. Operational complexity will exist regardless and depending on your organization different solutions will be appealing for different reasons.
And right there they broke the service for legitimate users. Totally unacceptable collateral damage IMHO.
This is how Mailgun and their ilk operate, and while it's annoying to get bitten by their rules (we forgot to warm up a mailing list once and got a temporary suspension as our bounce rate was too high) they treated us like adults, told us why our service had been suspended and proceeded to help us clean up the mailing list. If they had pulled some shadow banning BS we'd have just left the service as we wouldn't be able to trust that they're not messing us (and our clients) around.
Shadow banning works just fine for online forums and the like. It's a pretty terrible method of rate limiting though.
Second thought: token buckets have a nice property of being really cacheable once they expire. You can push down a "won't refill until timestamp" and then clients can skip checking altogether.
This isn't just about the application being able to handle double the amount of load, it about keeping costs down and preventing someone from drastically increasing your bill. I work at a fintech company, and many of the 3rd party providers we use have a non-negligible cost per API call. You wouldn't want to wake up one morning and find that someone triggered one of those calls thousands of times while you were asleep.
Your incoming requests should be balanced across all of the servers so you just derive the allowed throughput and divide by the number front ends....
In practical terms this means that whenever the characteristics of your environment changes, your rate limiting suddenly gets wonky. If you're doing a rolling deployment and take 1/3 of your machines out of service, or if some machines go down for maintenance, or a variety of other things happen to your machines, you're going to end up with fluctuating rate limits.
It sounds like OP is running a relatively mature service, so this probably isn't the best idea for them.
In the degenerate case, imagine a rate limit of 2 total requests per minute load balanced across 2 servers with enough traffic that my request is basically hitting a random server. In this case, 50% of the time my second request will be rate limited incorrectly because each server has a bucket of 1 and my second request went to the same server as my first.
I'm sure someone smarter than me (and better at probability) could come up with an equation where you input your rate limit & number of servers and it tells you the probability of a false positive for a user.
In practice, when you are receiving enough traffic to make throttling practical you aren't usually throttling at 2 RPM across 2 servers.
Otherwise you can easily get them to hit their limit super early with bad luck.
Each server contains a map of tokens per per client filling at a fixed interval.
That interval is calculated by taking the total global token refresh rate and dividing it by the number of servers.
The end result is exactly the same but, now you are stateless and have eliminated the bottleneck of a central token bucket.
A better approach I think is delaying request handling based on resource allocation. If one client is disproportionately using up more time than others, then processing that clients' request will be queued up to process later, while well behaving clients will get their requests handled quickly. I think this is a more realistic approach and imposes less arbitrary restrictions.
Unlike most rate limiters, this requires little or no tuning. Each source IP address gets an equal amount of service. Sources which generate heavy load compete with themselves, not others.
The main tuning parameter is how long a queue can get. It's useful to have both length and time limits. But only hostile sources should hit those. They don't change with capacity or number of users. Brief load transients should be handled normally, if slowly.
This won't stop a distributed denial of service attack from many IP addresses. But anything coming from a small number of sources will be handled without much trouble.
Queues and backpressure to properly serve bursty traffic and rate limiting to protect against bad actors.
If you issue a 429, the client knows it needs to wait and retry, and you've pushed the backpressure all the way past the demarcation line of your own infrastructure.
See https://www.ratelim.it/documentation/safety for how to configure a limit with "burst" in RateLim.it
Queueing sounds nice, but the web is synchronous. If somebody crushes you do you really want to queue them and then process the reqs 5 min later? After the connection has already terminated?
[0] https://github.com/michaelmior/locomotor
[1] https://github.com/michaelmior/locomotor/blob/master/bench/t...
local id = "user_1"
-- Get the number of tokens the user has left
local tokens = redis.call("HGET", id, "tokens")
if tokens = 0 then
-- User has no tokens left
return false
else
-- User has tokens left, decrement token count
redis.call("HINCRBY", id, "tokens" -1)
return true
end
Redis lua scripts block while they are running, so the two redis calls here cannot be interleaved with other reads, as in the token bucket example from the article.My apps all write the lua into redis and store the hash when they boot up. Duplicative, but means everybody is on the same page and it's easy to store the lua in the main codebase.
Reading all the design and discussion I was. Rey curious how you structured things at a brass tacks storage level.
This article is about the server deciding which requests to reject. Exponential backoff is a strategy clients use deciding when to retry after their request is rejected. (Plus, the article is about malicious clients; they're not going to follow your preferred backoff strategy.)
More concretely, how would exponential backoff ensure that you don't allow more than 10 requests/second per user?
So this should be a tool that every service at scale has access to.
IIRC the lack of rate limiting burned Apple relatively recently.
Is this yet another area where we all reinvent the wheel? I've yet to see a recommendation for an off the shelf solution
https://engineering.classdojo.com/blog/2015/02/06/rolling-ra...
Obviously, in some cases such design is not possible. The classic case is HTTP, where the entire purpose is to supply some (arbitrarily large) volume of data in response to a small request, and therefore there is a bandwidth challenge.
Conventional defense strategies tend to utilize the fact that TCP requires a three-way handshake to instantiate, thus validating the peer's IP address (unlike UDP), and include:
(1) An authenticated session, eg. using a session key derived from a separate API call.
(2) Rate limiting per authenticated user, either based upon data over time or request frequency. (This alone is the subject of the article)
(3) Segregating read-only, cacheable data (even if it expires within seconds) on separate infrastructure such as CDNs or memory-based caches.
(4) Aggressive use of HTTP caching.
(5) Careful tuning of HTTP session length related configuration to suit the application profile.
A newer strategy is the use of captcha, however this is not viable in automated (ie. API) use cases. Another relatively 'new' (for HTTP) strategy is the use of websockets or other mechanisms for real time 'push', to avoid the latency and processing overheads of the conventionally enforced HTTP request-response model.
Additional options would include segregating clients to speak to different servers (ideally in different data centers, on different links) such that overhead may be scaled across different infrastructure. Thus even if a single server is targeted by an attacker damage is limited in scope.
Another architectural defense would be the enforcement of a gateway/proxy layer (internal or third party) obscuring real datacenter netblocks from attackers, however this comes at the cost of latency where data cannot be cached.
Cloudflare basically provide all of the above (plus additional features) as a service.
Finally, in native mobile application development where an API is the primary interface with some central system, another simple step that can be taken (with appropriate care regarding peer authentication and cache invalidation) is the use of a cached set of IP addresses within the client as a means to identify servers. In this way, attacks against DNS infrastructure will also be nullified, and you can focus the demands of segments of your user base on different infrastructure. (Here in China, DNS is often hijacked or broken, though this is much less of a concern in typical western environments. It will also reduce startup latency on slow mobile links the world over.)