Without something being done, the data that these scrapers rely on would eventually no longer exist.
Its not a issue when somebody does "ethical" scraping, with for instance, a 250ms delay between requests, and a active cache that checks specific pages (like news article links) to rescrape at 12 or 24h intervals. This type of scraping results in almost no pressure on the websites.
The issue that i have seen, is that the more unscrupulous parties, just let their scrapers go wild, constantly rescraping again and again because the cost of scraping is extreme low. A small VM can easily push 1000's of scraps per second, let alone somebody with more dedicated resources.
Actually building a "ethical" scraper involves more time, as you need to fine tune it per website. Unfortunately, this behavior is going to cost the more ethical scraper a ton, as anti-scraping efforts will increase the cost on our side.
I've been an active lurker in the self-hosting community and I'm definitely not alone. Nearly everyone hosting public facing websites, particularly those whose form is rather juicy for LLMs, have been facing these issues. It costs more time and money to deal with this, when applying a simple User-Agent block would be much cheaper and trivial to do and maintain.
sigh
We’re talking a JavaScript file of strings to respond like “login failed”, “reset your password” just over and over again. Hundreds of fetches a day, often from what appears to be the same system.