Anthropic is scraping websites so fast it's causing problems (opens in new tab)

(pivot-to-ai.com)

50 pointsmichaelhoffman1y ago24 comments

24 comments

At least their bots accurately identify themselves in the User-Agent field even when they're ignoring robots.txt, so serverside blocking is on the table for now at least.

Bytedances crawler (Bytespider) is another one which disregards robots.txt but still identifies itself, and you probably should block it because it's very aggressive.

It's going to get annoying fast when they inevitably go full blackhat and start masquerading as normal browser traffic.

0x000xca0xfe1y ago

They should be easy to fingerprint if they are this aggressive.

znpy1y ago

people should probably heavily rate-limit (or block block their) whole ASN-advertised subnets, just in case.

ericholscher1y ago

For those saying "just use a CDN", it's not nearly that simple. Even behind a CDN, the crawlers on our site are hitting large files that aren't frequently accessed. This leads to large cache miss rates:

https://fosstodon.org/@readthedocs/112877477202118215

l1n1y ago

> Sites use robots.txt to tell well-behaved web crawlers what data is up for grabs and what data is off limits. Anthropic ignores it and takes your data anyway. That’s even if you’ve updated your robots.txt with the latest configuration details for Anthropic. [404 Media]

doesn't seem supported by the citation, https://www.404media.co/websites-are-blocking-the-wrong-ai-s...

JohnFen1y ago

It didn't take long for the "responsible" Anthropic to show its true colors.

ChrisArchitect1y ago

[dupe]

Some more discussion https://news.ycombinator.com/item?id=41060559

lolpanda1y ago

Cloudflare has a switch to block all the unknown bots other than the well behaved one. Would this be a simple solution to most of the sites? I wonder if the main concern here is that the sites don't want to waste bandwidth/compute for AI bots or they don't want their content to be used for training.

jakubsuchy1y ago

Just like Cloudflare many providers now just allow blocking: https://www.haproxy.com/blog/how-to-reliably-block-ai-crawle...

(disclaimer: I wrote this blog post)

superkuh1y ago

I've noticed Anthropic bots in my logs for more than a year now and I welcome them. I'd love for their LLM to be better at what I'm interested in. I run my website off my home connection on a desktop computer and I've never had a problem. I'm not saying my dozens of run-ins with the anthropic bots (there have been 3 variations I've seen so far) are totally representative, but they've been respecting my robots.txt.

They even respect extended robots.txt features like,

    User-agent: *
    Disallow: /library/*.pdf$

I make my websites for other people to see. They are not secrets I hoard who's value goes away when copied. The more copies and derivations the better.

I guess ideas like creative commons and sharing go away when the smell of money enters the water. Better lock all your text behind paywalls so the evil corporations won't get it. Just be aware, for every incorporated entity you block you're blocking just as many humans with false positives, if not more. This anti-"scraping" hysteria is mostly profit motivated.

JohnFen1y ago

> This anti-"scraping" hysteria is mostly profit motivated.

That seems overly reductive.

First, it sounds like you're insinuating that the people claiming the bots are causing actual disruption to their operations are lying. If that's your intent, some amount of evidence for that would be welcome.

Second, lots of people don't want their content to be used to train these models for reasons that have nothing whatsoever to do with money. Trying to avoid contributing to the training of these models is not the equivalent of rejecting the idea of the free exchange of information.

superkuh1y ago

>That seems overly reductive.

I qualified my statement but you've chosen to ignore that. I've been paying attention to the Anthropic bots closely for a (relatively) long time and this mastodon group's problems come as a surprise to me based off that lived experience. I don't doubt the truth of their claims. I looked at https://cdn.fosstodon.org/media_attachments/files/112/877/47... and I see the bandwidth used. But like I said,

>I'm not saying my dozens of run-ins with the anthropic bots (there have been 3 variations I've seen so far) are totally representative,

My take here is that their one limited experience also isn't representative and others are projecting it on to the entire project due to a shifting cultural perception that "scraping" is something weird and bad to be stopped. But it's not. If it were me I'd be checking my webserver config to be sure robots.txt is actually being violated. And I'd check my set per user-agent bandwidth limits in nginx to make sure they matched. That'd solve it. I'm sure the mastodon software has better solutions even if they haven't solved their own DDoS generating problem since 2017 (ref: https://github.com/mastodon/mastodon/issues/4486)

1 more reply

zorrn1y ago

I don't know if I should Block Claude. I think it's really good and use it regularly and I think it's not fair to say that others should provide content.

dzonga1y ago

what happens when ai scrappers no longer have info to scrap.

funny thing - with wasm, the web won't be scrappable.

AgentME1y ago

WebAssembly doesn't mean not using the DOM and it isn't the only way to use canvas elements.

iLoveOncall1y ago

One million hits in 24 hours is only 11 TPS, if that's causing issues, then Anthropic isn't the problem, your application or hosting is.

choppaface1y ago

If 90% of the content doesn’t get daily attention then the spider can blow out the caches and cause inordinate costs.

ericholscher1y ago

Yea, this happened to us, and you can see the cache profile of a crawler here:

https://fosstodon.org/@readthedocs/112877477202118215

A bunch of the traffic hit the origin.

1 more reply

iLoveOncall1y ago

Caches? You could not cache anything and 11 TPS would be just fine to run on a gameboy.

1 more reply

j / k navigate · click thread line to collapse

24 comments

jsheard1y ago

At least their bots accurately identify themselves in the User-Agent field even when they're ignoring robots.txt, so serverside blocking is on the table for now at least.

Bytedances crawler (Bytespider) is another one which disregards robots.txt but still identifies itself, and you probably should block it because it's very aggressive.

It's going to get annoying fast when they inevitably go full blackhat and start masquerading as normal browser traffic.

0x000xca0xfe1y ago

They should be easy to fingerprint if they are this aggressive.

znpy1y ago

people should probably heavily rate-limit (or block block their) whole ASN-advertised subnets, just in case.

ericholscher1y ago

https://fosstodon.org/@readthedocs/112877477202118215

l1n1y ago

doesn't seem supported by the citation, https://www.404media.co/websites-are-blocking-the-wrong-ai-s...

JohnFen1y ago

It didn't take long for the "responsible" Anthropic to show its true colors.

ChrisArchitect1y ago

[dupe]

Some more discussion https://news.ycombinator.com/item?id=41060559

lolpanda1y ago

jakubsuchy1y ago

Just like Cloudflare many providers now just allow blocking: https://www.haproxy.com/blog/how-to-reliably-block-ai-crawle...

(disclaimer: I wrote this blog post)

superkuh1y ago

They even respect extended robots.txt features like,

    User-agent: *
    Disallow: /library/*.pdf$

I make my websites for other people to see. They are not secrets I hoard who's value goes away when copied. The more copies and derivations the better.

JohnFen1y ago

> This anti-"scraping" hysteria is mostly profit motivated.

That seems overly reductive.

superkuh1y ago

>That seems overly reductive.

>I'm not saying my dozens of run-ins with the anthropic bots (there have been 3 variations I've seen so far) are totally representative,

1 more reply

zorrn1y ago

I don't know if I should Block Claude. I think it's really good and use it regularly and I think it's not fair to say that others should provide content.

dzonga1y ago

what happens when ai scrappers no longer have info to scrap.

funny thing - with wasm, the web won't be scrappable.

AgentME1y ago

WebAssembly doesn't mean not using the DOM and it isn't the only way to use canvas elements.

iLoveOncall1y ago

One million hits in 24 hours is only 11 TPS, if that's causing issues, then Anthropic isn't the problem, your application or hosting is.

choppaface1y ago

If 90% of the content doesn’t get daily attention then the spider can blow out the caches and cause inordinate costs.

ericholscher1y ago

Yea, this happened to us, and you can see the cache profile of a crawler here:

https://fosstodon.org/@readthedocs/112877477202118215

A bunch of the traffic hit the origin.

1 more reply

iLoveOncall1y ago

Caches? You could not cache anything and 11 TPS would be just fine to run on a gameboy.

1 more reply

j / k navigate · click thread line to collapse