undefined | Better HN

0 pointsrockskon8mo ago0 comments

LLM scrapers have dramatically been increasing the cost of hosting various small websites.

Without something being done, the data that these scrapers rely on would eventually no longer exist.

0 comments

I think the correct term is, that unrestricted LLM scrapers have dramatically been increasing the cost of hosting various small websites.

Its not a issue when somebody does "ethical" scraping, with for instance, a 250ms delay between requests, and a active cache that checks specific pages (like news article links) to rescrape at 12 or 24h intervals. This type of scraping results in almost no pressure on the websites.

The issue that i have seen, is that the more unscrupulous parties, just let their scrapers go wild, constantly rescraping again and again because the cost of scraping is extreme low. A small VM can easily push 1000's of scraps per second, let alone somebody with more dedicated resources.

Actually building a "ethical" scraper involves more time, as you need to fine tune it per website. Unfortunately, this behavior is going to cost the more ethical scraper a ton, as anti-scraping efforts will increase the cost on our side.

Tmpod8mo ago

The biggest issue for me is clearly masquerading their User-Agent strings. Regardless of whether they are slow and respectful crawlers, they should clearly identify themselves, provide a documentation URL and obey robots.txt. Without that, I have to play a frankly tiring game of cat and mouse, wasting my time and the time of my users (they have to put up with some form of captcha or PoW thing).

I've been an active lurker in the self-hosting community and I'm definitely not alone. Nearly everyone hosting public facing websites, particularly those whose form is rather juicy for LLMs, have been facing these issues. It costs more time and money to deal with this, when applying a simple User-Agent block would be much cheaper and trivial to do and maintain.

sigh

trollbridge8mo ago

I use Cloudflare and edge caching, so it doesn’t really affect me, but the amount of LLM scraping of various static assets for apps I host is ridiculous.

We’re talking a JavaScript file of strings to respond like “login failed”, “reset your password” just over and over again. Hundreds of fetches a day, often from what appears to be the same system.

SchemaLoad8mo ago

Turn on the the Cloudflare tarpit. When it detects LLM scrapers it starts generating infinite AI slop pages to feed the scrapers. Ruining their dataset and keeping them off your actual site.

j / k navigate · click thread line to collapse