Ask HN: Best way to stop bot traffic?

36 pointsethor3y ago29 comments

I have a site that gets bombarded with worthless robot traffic. I have previously used cloudflare but a lot of bots still gets through.

What have you used that has been effective?

29 comments

freedomben3y ago

Make sure your Cloudflare settings are as aggressive as possible. You might need to upgrade to the first paid level (I think "pro"?) to activate the most aggressive, but it does work very well.

After that, you can throw a CAPTCHA on pages (particularly submission pages), but that will harm legitimate users as well as bots.

Make sure your origin server is only reachable from Cloudflare. If people can hit it directly, then they bypass Cloudflare. If you use firewalld, I wrote this in my setup script that you can use:

    for range in $(curl -s -X GET "https://api.cloudflare.com/client/v4/ips" | jq -r '.result.ipv4_cidrs[]'); do
      for port in 80 443; do
        echo "Inserting firewalld rule for address range '${range}' on port '${port}'"
        firewall-cmd --zone=public --permanent \
          --add-rich-rule="rule family=\"ipv4\" source address=\"${range}\" port protocol=\"tcp\" port=\"${port}\" accept"
      done
    done

    firewall-cmd --remove-service=http --permanent
    firewall-cmd --remove-service=https --permanent
    firewall-cmd --reload

gruez3y ago

> If you use firewalld, I wrote this in my setup script that you can use:

Aren't you supposed to use argo or certificate authentication for this?

viraptor3y ago

Not supposed to. These are all valid options.

ethorOP3y ago

In my case I am concerned about false positives since visitor experience is a higher priority than blocking all bots. Cloudflare, in my experience, do generate too many false positives when it's too aggressive. A very nice idea though in other cases.

out-of-ideas3y ago

define what visitor experience means to you; lots of folks think captcha is acceptable, or loading in an operating-system-amount-of-code-for-javascript like reddit does, as acceptable. (if we define OS's as bloat-ware like MS has nowadays (or any carrier specific phone LOL); one could say: reddit is a javascript-OS minus a good website for those emacs/vi fans)

You mentioned that allow-paths is not quite an option as the main page gets hit by the bots; how are you detecting this - maybe some automation here is all that is needed? Note that lots of folks are using ad-blockers of varying sorts which some analytics sites claim as 'bots' or 'grey visitors' which make even landing on some home-pages a very sad experience when a full blown captcha shows up ( then I for sure stay away )

tothrowaway3y ago

Most bots don't bother setting cookies, or downloading CSS. Exploit this by including a dummy CSS file on your site that, on the backend, stores the visitor's IP in some kind of database, or sets a cookie. If you get multiple visits from an IP that never hit the CSS file, you can be reasonably confident the user is not legit. You need to be careful about not blocking good bots though. Do a reverse DNS lookup before actually blocking an IP to make sure it's not Googlebot, yandexbot, bingbot, slurp, etc. OpenResty is great for implementing this.

It has the nice side effect of protecting you from run-of-the-mill DDoS attacks too.

(I realize half my comments here are about OpenResty, but I have no affiliation with them. I'm just a happy user.)

out-of-ideas3y ago

sounds like a great way for folks that utilize a simple curl your site to see if things are working right to get on the block list lol

ethorOP3y ago

Excellent idea!

clafferty3y ago

You’re going to need to make a few aggressive WAF Rules, pepper in some whitelisting rules and if you can, add rate limiting.

1. Block all unverified bots with a bot score of 1. This will still allow popular web crawlers but could be strict enough to block a curl request.

2. Use Manage Challenge for unverified bots with a bot score less than 30. This will silence most of the trouble making bots and provides a JavaScript (not necessarily Captcha) solution for users who are incorrectly scored.

3. Add rate limiting. Figure out a realistic access rate, double it and use that as a hard limit that will block traffic for an hour or day depending on your needs.

4. Add more sensitive rate limits and play with manage challenge rules. Use the simulate option before enabling any rate limits. You can add challenges here if you feel a limit might be affecting users too. Simulate for a few days before enabling

5. Review rate limits and firewall reports regularly and adjust. With any Managed Challenge rules make sure to check the percentage completed to see if you’re trapping real users. This number should be as close to 0 as possible. Repeat step 4.

You’ll want to get around your own blocking rules with some complimentary whitelisting rules.

Although it’s advised to lock down your origin server to prevent non Cloudflare traffic hitting your server you might not be able to do so easily, if you’ve got load balancers and other infra in your way that can’t be touched. Just make sure your root domain isn’t leaking your www IP address. You can use CNAME flattening and you should be alright.

The difficulty in these solutions is managing all the rules you can make. Things can quickly become too complicated to make changes easily. Keep it simple, have a few basic but aggressive blocking rules and revise your whitelist and rate limits regularly. Good luck

ethorOP3y ago

Good suggestions. A priority is to not in any way diminish the visitor experience, I am therefore very wary of false positives.

codegeek3y ago

You have to get more aggressive unfortunately which may sometimes block real users but do the following:

- Setup captcha or just block users from certain countries if you know where your traffic comes from. This can sometimes create issues for your users on VPN but then you have to make the call depending on how many of your users may be using VPN etc. At the minimum, add a captcha.

- Create more Page rules in cloudflare and block if they don't match the rule. For example, if your URLs start with a specific prefix, drop anything that is a no match.

- Make sure to return 444 status from your server directly if bots are bypassing cloudflare and hitting the IP directly. Sample code for nginx 1.19 or higher:

    server {
      listen 80 default_server;
      listen [::]:80 default_server;

      listen 443 default_server;
      listen [::]:443 default_server;
      ssl_reject_handshake on;

      server_name _;
      return 444;
    }

If bots are getting too aggressive, I start with Block first, ask questions later. Depending on your traffic and users, it may be the right strategy.

ethorOP3y ago

Blocking real users is not an option - but I can see that this would be a good setup if blocking bots were of the absolute highest priority.

andrewmcwatters3y ago

Man, if the state of the art is to suggest something Cloudflare related, that's a really sad state of affairs.

netsectoday3y ago

Totally agree. I’m thinking in terms and concepts like “iptables”, “ipset”, “fail2ban”, “cidr”, “asn”… while everyone else is thinking about how to change their cloudflare settings.

OP even said they gave up on cloudflare but people are trying to upsell him on the higher tiers through this hn post!

solardev3y ago

They really are very good. The only real vendor in the IaaS space innovating once AWS and its ilk stagnated.

Before them, companies like Imperva were charging way more for inferior blocking.

Cloudflare is awesome!

ethorOP3y ago

They do cater to a need and their solution is effective.

viraptor3y ago

I'd go against the "just increase the cf strictness" advice. It's counting on cf basically doing something magic and hoping to not about real users - and that's not really possible.

1. Why do you want to stop bots? Are they actually overloading your resources, or are they just noisy in the logs. If you can easily handle the traffic, maybe find a way to filter the logs better.

2. How do you know they're bots? If they're easy to identify, can you write a few simple rules to remove most of them?

2a. Are they mindless scans? Make sure your app doesn't even see requests to resources which don't exist.

2b. Are they scraping content? Set up per-resource-per-IP rate limits (token bucket style)

2c. Are they coming from a specific network, for example tor, AWS, or similar? Put in an auto updating list of sources that get dropped at firewall level.

3. As mentioned in other comments, if you're using some proxy in front of your service, ensure you drop any traffic which bypasses is.

Basically consider what's actually happening and respond to that. There's no setting that will improve things without side effects, or it would be already turned on.

ethorOP3y ago

1. Bots are more than half of my traffic right now and they don't provide any benefits except using my bandwidth and distort my statistics. 2. Strange behavior + technology 2a-c. Could be a part of a solution. 3. Good idea

The purpose of this thread is to gather ideas and experiences. A silver bullet would be great, but since it's not realistic, all ideas are more than welcome.

trinovantes3y ago

Blocking IP addresses from cloud providers will eliminate most bots

https://github.com/brianhama/bad-asn-list

Unfortunately you'll also alienate VPN users so you'll have to decide if it's worth the cost

ianpurton3y ago

I use cloudflare and only allow paths I know. i.e /blog* /app/*

I block everything else

That kills most of it.

ethorOP3y ago

Most of the bots concerns the frontpage.

JimWestergren3y ago

Try increasing the security level in CF. You could also activate Firewall rules for certain countries that are not relevant for your site, make sure to not apply it to known bots (googlebot etc).

ethorOP3y ago

I am concerned about false positives - but I will weigh cons and pros. Thank you for your suggestion.

NetToolKit3y ago

If you're interested in a non-Cloudflare solution, we have developed a service called Gatekeeper, and we'd really like to get your thoughts on whether it might suit your needs: https://www.nettoolkit.com/gatekeeper/about

Essentially, Gatekeeper is a rules-engine with a fancy UI that allows you to craft policies specific to your site and the traffic that is visiting your site. For example, you can say "Allow Googlebot" and "Show CAPTCHA to visitors from AWS on every fifth visit". If you'd like to communicate offline, you can find our email address in my profile.

ethorOP3y ago

How do you differ from Cloudflare?

NetToolKit3y ago

In terms of implementation, one key difference is that Cloudflare sits between your servers and your users, whereas with Gatekeeper, your servers query Gatekeeper before deciding what response to send to users. One of the ramifications is that with Gatekeeper, your users never see a third-party page about verifying that they are not bots. It also means that when using Gatekeeper, you can actually debug if/when something unexpected happens.

Gatekeeper also allows you to look over previous traffic patterns and adjust your custom rules. Philosophically, our assumption is that website operators know their sites the best, and know what content they are sensitive about and Gatekeeper is a tool to help review traffic and express custom rules. I think Cloudflare is a better fit for operators who don't want to spend time fiddling with various very granular knobs to optimize for their specific site and their specific concerns.

Gatekeeper can also be used for other purposes. For example, the idea behind Gatekeeper originated because we have a site where we want to make information freely available to humans, but not to scrapers. However, Gatekeeper could also be used to upsell users who frequent your site (e.g. "We noticed you've viewed 10 articles in the past week, please subscribe!").

If you're considering custom logic like https://news.ycombinator.com/item?id=33160679 , Gatekeeper already has a rules engine so that you specify such policies without having to write your own infrastructure.

I'd be happy to chat further to learn more about what you're looking for and to see if Gatekeeper might be able to help. Feel free to reach out via email (in profile) if you'd like to talk more.

joekok333y ago

If you are in cpanel environment. Go into your hosting and look at the Ip's that are coming in. Apply filter to those and that should stop those bot traffic.

_humancompiler3y ago

https://tlstoy.com/ detects fraudulent requests

rpigab3y ago

Advertise your website to as many real people as possible, worthless robot traffic will then seem less important in comparison to actual human traffic. You could use billboards, hand out flyers, etc.

j / k navigate · click thread line to collapse

29 comments

freedomben3y ago

Make sure your Cloudflare settings are as aggressive as possible. You might need to upgrade to the first paid level (I think "pro"?) to activate the most aggressive, but it does work very well.

After that, you can throw a CAPTCHA on pages (particularly submission pages), but that will harm legitimate users as well as bots.

Make sure your origin server is only reachable from Cloudflare. If people can hit it directly, then they bypass Cloudflare. If you use firewalld, I wrote this in my setup script that you can use:

    for range in $(curl -s -X GET "https://api.cloudflare.com/client/v4/ips" | jq -r '.result.ipv4_cidrs[]'); do
      for port in 80 443; do
        echo "Inserting firewalld rule for address range '${range}' on port '${port}'"
        firewall-cmd --zone=public --permanent \
          --add-rich-rule="rule family=\"ipv4\" source address=\"${range}\" port protocol=\"tcp\" port=\"${port}\" accept"
      done
    done

    firewall-cmd --remove-service=http --permanent
    firewall-cmd --remove-service=https --permanent
    firewall-cmd --reload

gruez3y ago

> If you use firewalld, I wrote this in my setup script that you can use:

Aren't you supposed to use argo or certificate authentication for this?

viraptor3y ago

Not supposed to. These are all valid options.

ethorOP3y ago

out-of-ideas3y ago

tothrowaway3y ago

It has the nice side effect of protecting you from run-of-the-mill DDoS attacks too.

(I realize half my comments here are about OpenResty, but I have no affiliation with them. I'm just a happy user.)

out-of-ideas3y ago

sounds like a great way for folks that utilize a simple curl your site to see if things are working right to get on the block list lol

ethorOP3y ago

Excellent idea!

clafferty3y ago

You’re going to need to make a few aggressive WAF Rules, pepper in some whitelisting rules and if you can, add rate limiting.

1. Block all unverified bots with a bot score of 1. This will still allow popular web crawlers but could be strict enough to block a curl request.

3. Add rate limiting. Figure out a realistic access rate, double it and use that as a hard limit that will block traffic for an hour or day depending on your needs.

You’ll want to get around your own blocking rules with some complimentary whitelisting rules.

ethorOP3y ago

Good suggestions. A priority is to not in any way diminish the visitor experience, I am therefore very wary of false positives.

codegeek3y ago

You have to get more aggressive unfortunately which may sometimes block real users but do the following:

- Create more Page rules in cloudflare and block if they don't match the rule. For example, if your URLs start with a specific prefix, drop anything that is a no match.

- Make sure to return 444 status from your server directly if bots are bypassing cloudflare and hitting the IP directly. Sample code for nginx 1.19 or higher:

    server {
      listen 80 default_server;
      listen [::]:80 default_server;

      listen 443 default_server;
      listen [::]:443 default_server;
      ssl_reject_handshake on;

      server_name _;
      return 444;
    }