I guess to make this work well you have to do classification (regular request vs. malicious) on several protocol layers and then reroute or drop packets accordingly. But how does that prevent severe service degradation - you still have to do some kind of work (in computation and energy) on the listening side or can fat edge-servers just eat that up?
You can break down DDoS into roughly three categories:
1. Volumetric (brute force)
2. Application (targeting specific app endpoints)
3. Protocol (exploiting protocol vulnerabilities)
DDoS mitigation providers concentrate on 1 & 3.
The basic idea is: attempt to characterize the malicious traffic if you can, and or divert all traffic for the target. Send the diverted traffic to a regional "scrubbing center"; dirty traffic in, clean traffic out.
The scrubbing centers buy or build mitigation boxes that take large volumes of traffic in and then do heuristic checks (liveness of sender, protocol anomalies, special queueing) before passing it to the target. There's some in-line layer 7 filtering happening, and there's continuous source characterization happening to basic network layer filters back towards ingress.
You can do pretty simple statistical anomaly models and get pretty far with attacker source classification, and to track targets and be selective about what things need to be diverted.
A lot of major volumetric attacks are, at the network layer, pretty unsophisticated; they're things like memcached or NTP floods. When you're special-casing traffic to a particular target through a scrubbing center, it's pretty easy to strip that kind of stuff off.
Where these heuristics done in hardware then? ASICs FPGAs? Could you elaborate what the "liveness of sender" and "special queueing" heuristics are?
[0] - https://en.wikipedia.org/wiki/Slowloris_(computer_security)
It's definitely not the case that all DDoS attacks can be reliably cleaned up in an ISP scrubbing center.
https://patents.google.com/patent/US7992192
Once the traffic was detected, the signature was sent to a second system that was a series of hardware optimized for layer 7 packet inspection. The devices were updated with signatures of current attacks, and then checked every incoming packet for that signature. Any packet that matched was parsed for where it was coming from, and then the router was updated to drop traffic from that source for a period of time.
As far as I know, today's techniques are fairly similar, along with just having a whole lot of computers that can absorb the traffic.
At least in the product I worked on, L7 processing was done purely in software. You could probably make hardware to do that but there's not a ton of benefit as you're pretty much constrained by memory bandwidth, not CPU power, once you start looking at anything past fixed headers.
(Our product also performed deep-packet inspection – in fact that was its original function – so the L7 processing was probably a bit more general than DDoS-only products.)
It also did layer 2 and 3 detection and looked for the stuff mentioned below like IP and port and if the 3 way handshake was “normal”. Stuff like that.
Was this custom DPI hardware or something from a vendor?
From there, you can add layers of protection ranging from simple things like blocking traffic that is obviously malicious (TCP flags, port numbers, etc) to more complex things like pattern recognition in both the overall trends of the data and on a per-packet basis. After you've decided with a decent certainty that it's not malicious traffic, you pass it off to the actual backend service.
For systems that are designed to scale horizontally, that may be a neighboring machine (or even the same machine) in that data center. For single-homed backend systems that can't scale horizontally to multiple locations, that "clean" traffic is then sent via some mechanism (possibly a GRE tunnel, possibly just raw internet traffic to a secret IP) to the backend service. Depending on the methodology used, the filtering may be a true bidirectional proxy, in which case the reply goes back to the scrubber and then out to the original sender, or it may be a unidirectional proxy, in which case the reply goes directly back to the original sender.
All attack mitigation works in some way like this, whether it be by designing your application from the beginning to be multi-homed and able to run in multiple datacenters, or by installing a separate mitigation layer that scrubs attack traffic.
1. Static page caching (in RAM ideally) - dynamically generated content will kill you quicker than anything else, especially calls to a database. WordPress is very easy to kill in it's default state.
2. Kill high frequency requests from the same location as quickly as possible (make sure your response is less than the data they send you - ultimately you want their systems to be busier than yours). You want to free the port up as quickly as possible.
3. Move anybody you can identify as a legitimate user (credentials, low frequency requests) out to another server if possible.
Firewall wise, my system sits on the cloud, so usually high frequency traffic is the only issue I have to deal with. Interested to hear any advice of other people here.
Or, if you want to be fancy, "tarpit" them (complete TCP handshake and then ignore, forcing attacker to actually commit resources), but apparently that's of questionable value these days. [1]
I find that using a combination of nginx's limit_req and fail2ban over nginx logs is an easy measure that already goes a long way in handling basic types of DoS, like clients producing an abnormally high volume of requests.
A distributed DoS attack has many sources, and when including botnets on infected consumer systems you have legitimate source addresses/devices as well. This defeats most "blackhole the source" options as the source is the same thing as legitimate visitors/customers.
So for a DDoS that simply tries to saturate your link(s) and where you can't blackhole the source, the only 'protection' is having more bandwidth than the attacker(s) has (or have).
After that a few other things come in to play, attack-traffic from legit sources may have a pattern, so while you can't blackhole upstream, you can prevent traffic with a pattern to get to the actual application/site. This is relevant in cases where you might suffer from application overload before link overload. If your link can handle the DDoS traffic but your application can't, you're still screwed. (and with application I include load balancers, databases, storage etc.)
Anycast is the most important piece of the puzzle, allowing you to route traffic to a bunch of different locations.
Let's say you can handle 10 Gbps at a single location. If the traffic is evenly split between 100 destinations then you can have a single IP that can handle 1 Tbps of traffic.
Of course, the setup behind these IPs might vary a lot, and one might even use DNS load balancing in front of the IPs.
If you're going to find patterns to decide what to block then you first need to make sure you receive all the traffic. If a single entry point can't handle it, well, then you need to load balance the incoming traffic.
We try to publish most of what we do, the more obvious links:
https://blog.cloudflare.com/how-cloudflares-architecture-all...
https://blog.cloudflare.com/meet-gatebot-a-bot-that-allows-u...
https://blog.cloudflare.com/the-root-cause-of-large-ddos-ip-...
https://blog.cloudflare.com/memcrashed-major-amplification-a...
https://blog.cloudflare.com/syn-packet-handling-in-the-wild/
https://blog.cloudflare.com/reflections-on-reflections/
https://blog.cloudflare.com/say-cheese-a-snapshot-of-the-mas...
https://blog.cloudflare.com/the-new-ddos-landscape/
https://blog.cloudflare.com/unmetered-mitigation/
https://blog.cloudflare.com/introducing-the-p0f-bpf-compiler...
And maany more.
Also two talks:
https://idea.popcount.org/2016-02-01-enigma---building-a-dos...
https://idea.popcount.org/2015-11-16-black-hat-eu---defendin...
> But how does that prevent severe service degradation
It doesn't. You DROP the most specific thing you can. To avoid collateral damage we are able to do "Scattering" (move client across IPs with the hope the attack won't follow), and for example apply the controversial limits only in certain geographical areas (anycast network allows this).
> you still have to do some kind of work (in computation and energy) on the listening side
Yes. BPF for L3 works like charm. Read on XDP.
> or can fat edge-servers just eat that up?
Yes and no. You have to specifically optimize, whatever you do probably won't make Apache or IIS work under DDoS. Most vendors use "scrubbing centres", when they can have small number of beefy dedicated servers. We didn't find this architecture sufficient though, so in our case edge servers do handle the load. But we do spend time on tuning the servers and our applications.
From https://en.wikipedia.org/wiki/DDoS_mitigation:
One technique is to pass network traffic addressed to a potential target network through high-capacity networks with "traffic scrubbing" filters.
No Scrubs: The Architecture That Made Unmetered Mitigation Possible - https://blog.cloudflare.com/no-scrubs-architecture-unmetered...
Meet Gatebot - a bot that allows us to sleep - https://blog.cloudflare.com/meet-gatebot-a-bot-that-allows-u...
How Cloudflare's Architecture Allows Us to Scale to Stop the Largest Attacks - https://blog.cloudflare.com/how-cloudflares-architecture-all...
Kernel bypass - https://blog.cloudflare.com/kernel-bypass/
SYN packet handling in the wild - https://blog.cloudflare.com/syn-packet-handling-in-the-wild/
How to achieve low latency with 10Gbps Ethernet - https://blog.cloudflare.com/how-to-achieve-low-latency/
How to receive a million packets per second - https://blog.cloudflare.com/how-to-receive-a-million-packets...
Introducing the BPF Tools - https://blog.cloudflare.com/introducing-the-bpf-tools/
BPF - The Forgotten Bytecode - https://blog.cloudflare.com/bpf-the-forgotten-bytecode/
Introducing the p0f BPF compiler - https://blog.cloudflare.com/introducing-the-p0f-bpf-compiler...
Single RX queue kernel bypass in Netmap for high packet rate networking - https://blog.cloudflare.com/single-rx-queue-kernel-bypass-wi...
They are also doing a webinar (apologies for the link) so you can see exactly how it's implemented: https://www.incapsula.com/blog/want-to-see-what-a-live-ddos-...
They’d rather sell yet another service rather than supporting open protocols.
PSA: this user's profile definitely deserves reading, everyone go look
I also wonder why attack often last only a few hours.
2. Attack what? It's a distributed DoS, the calls are coming from all over. You mean going after every node sending traffic? What would "attacking them" even mean? It's not like you can shut them down.
3. All those nodes are innocent and being used unknowingly. Attacking them would be both illegal (see point 1) and pretty unethical: you're deliberately aiming at innocents and not the attacker (whom you have no chance of locating). Imagine if you took down a hospital attempting to stop an NTP flood on your dumb blog. Have fun explaining why that was necessary.
"Counter-hacking" sounds cool and sexy, but there are reasons why it is never done.
Using the botnets costs either money (if you're renting one) or opportunity (if you own one and could be renting it out).