What have you used that has been effective?
After that, you can throw a CAPTCHA on pages (particularly submission pages), but that will harm legitimate users as well as bots.
Make sure your origin server is only reachable from Cloudflare. If people can hit it directly, then they bypass Cloudflare. If you use firewalld, I wrote this in my setup script that you can use:
for range in $(curl -s -X GET "https://api.cloudflare.com/client/v4/ips" | jq -r '.result.ipv4_cidrs[]'); do
for port in 80 443; do
echo "Inserting firewalld rule for address range '${range}' on port '${port}'"
firewall-cmd --zone=public --permanent \
--add-rich-rule="rule family=\"ipv4\" source address=\"${range}\" port protocol=\"tcp\" port=\"${port}\" accept"
done
done
firewall-cmd --remove-service=http --permanent
firewall-cmd --remove-service=https --permanent
firewall-cmd --reloadAren't you supposed to use argo or certificate authentication for this?
You mentioned that allow-paths is not quite an option as the main page gets hit by the bots; how are you detecting this - maybe some automation here is all that is needed? Note that lots of folks are using ad-blockers of varying sorts which some analytics sites claim as 'bots' or 'grey visitors' which make even landing on some home-pages a very sad experience when a full blown captcha shows up ( then I for sure stay away )
It has the nice side effect of protecting you from run-of-the-mill DDoS attacks too.
(I realize half my comments here are about OpenResty, but I have no affiliation with them. I'm just a happy user.)
1. Block all unverified bots with a bot score of 1. This will still allow popular web crawlers but could be strict enough to block a curl request.
2. Use Manage Challenge for unverified bots with a bot score less than 30. This will silence most of the trouble making bots and provides a JavaScript (not necessarily Captcha) solution for users who are incorrectly scored.
3. Add rate limiting. Figure out a realistic access rate, double it and use that as a hard limit that will block traffic for an hour or day depending on your needs.
4. Add more sensitive rate limits and play with manage challenge rules. Use the simulate option before enabling any rate limits. You can add challenges here if you feel a limit might be affecting users too. Simulate for a few days before enabling
5. Review rate limits and firewall reports regularly and adjust. With any Managed Challenge rules make sure to check the percentage completed to see if you’re trapping real users. This number should be as close to 0 as possible. Repeat step 4.
You’ll want to get around your own blocking rules with some complimentary whitelisting rules.
Although it’s advised to lock down your origin server to prevent non Cloudflare traffic hitting your server you might not be able to do so easily, if you’ve got load balancers and other infra in your way that can’t be touched. Just make sure your root domain isn’t leaking your www IP address. You can use CNAME flattening and you should be alright.
The difficulty in these solutions is managing all the rules you can make. Things can quickly become too complicated to make changes easily. Keep it simple, have a few basic but aggressive blocking rules and revise your whitelist and rate limits regularly. Good luck
- Setup captcha or just block users from certain countries if you know where your traffic comes from. This can sometimes create issues for your users on VPN but then you have to make the call depending on how many of your users may be using VPN etc. At the minimum, add a captcha.
- Create more Page rules in cloudflare and block if they don't match the rule. For example, if your URLs start with a specific prefix, drop anything that is a no match.
- Make sure to return 444 status from your server directly if bots are bypassing cloudflare and hitting the IP directly. Sample code for nginx 1.19 or higher:
server {
listen 80 default_server;
listen [::]:80 default_server;
listen 443 default_server;
listen [::]:443 default_server;
ssl_reject_handshake on;
server_name _;
return 444;
}
If bots are getting too aggressive, I start with Block first, ask questions later. Depending on your traffic and users, it may be the right strategy.OP even said they gave up on cloudflare but people are trying to upsell him on the higher tiers through this hn post!
Before them, companies like Imperva were charging way more for inferior blocking.
Cloudflare is awesome!
1. Why do you want to stop bots? Are they actually overloading your resources, or are they just noisy in the logs. If you can easily handle the traffic, maybe find a way to filter the logs better.
2. How do you know they're bots? If they're easy to identify, can you write a few simple rules to remove most of them?
2a. Are they mindless scans? Make sure your app doesn't even see requests to resources which don't exist.
2b. Are they scraping content? Set up per-resource-per-IP rate limits (token bucket style)
2c. Are they coming from a specific network, for example tor, AWS, or similar? Put in an auto updating list of sources that get dropped at firewall level.
3. As mentioned in other comments, if you're using some proxy in front of your service, ensure you drop any traffic which bypasses is.
Basically consider what's actually happening and respond to that. There's no setting that will improve things without side effects, or it would be already turned on.
The purpose of this thread is to gather ideas and experiences. A silver bullet would be great, but since it's not realistic, all ideas are more than welcome.
https://github.com/brianhama/bad-asn-list
Unfortunately you'll also alienate VPN users so you'll have to decide if it's worth the cost
I block everything else
That kills most of it.
Essentially, Gatekeeper is a rules-engine with a fancy UI that allows you to craft policies specific to your site and the traffic that is visiting your site. For example, you can say "Allow Googlebot" and "Show CAPTCHA to visitors from AWS on every fifth visit". If you'd like to communicate offline, you can find our email address in my profile.
Gatekeeper also allows you to look over previous traffic patterns and adjust your custom rules. Philosophically, our assumption is that website operators know their sites the best, and know what content they are sensitive about and Gatekeeper is a tool to help review traffic and express custom rules. I think Cloudflare is a better fit for operators who don't want to spend time fiddling with various very granular knobs to optimize for their specific site and their specific concerns.
Gatekeeper can also be used for other purposes. For example, the idea behind Gatekeeper originated because we have a site where we want to make information freely available to humans, but not to scrapers. However, Gatekeeper could also be used to upsell users who frequent your site (e.g. "We noticed you've viewed 10 articles in the past week, please subscribe!").
If you're considering custom logic like https://news.ycombinator.com/item?id=33160679 , Gatekeeper already has a rules engine so that you specify such policies without having to write your own infrastructure.
I'd be happy to chat further to learn more about what you're looking for and to see if Gatekeeper might be able to help. Feel free to reach out via email (in profile) if you'd like to talk more.