> For example, blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future.
I would rather leave the internet entirely if AI chatbots become a primary user interface.
I know what you're saying, and totally agree. Unfortunately the term "AI" is now meaningless.
Sugary cereals and desserts have taken over much of snacking today, doesn't mean it's a good thing.
This is why I'm not reassured. robots.txt isn't sufficient to stop all webcrawlers, so there every reason to think it isn't sufficient to stop AI scrapers.
I'm still wanting to find a good solution to this problem so that I can open my sites up to the public again.
I would think filtering based on user agent will be the sweet spot for effort and performance. You could do some awful JavaScript monstrosity to detect the tiny fraction of bots who are sneaky, but if they're determined to be sneaky they will succeed at scraping.
> if they're determined to be sneaky they will succeed at scraping.
Yes, which is why I suspect I will never be able to open my websites up to the general public again. I live in hope anyway.
Really just encourages phones to be even more locked down
`if ($http_user_agent ~* ".*?(GPTBot|AI).*?") { return 410; }`
Its not perfect, but it should filter them indefinitely, will probably have to add some more terms in there over time.
It's more pragmatic to expect that any data that can be accessed one way or another will be scraped because interests aren't aligned between content authors and scrapers.
On the other hand, robots.txt was benefiting both search engines and content authors because it signaled data that wasn't useful to show in search results, therefore search engines had an incentive to follow its rules.
Or perhaps all crawlers regardless of respecting robots. Honestly I am not interested in improving some FAANGish algorithm with blogposts intended for my friends.
there is zero benefit to me in allowing OpenAI to absorb my content
it is a parasite, plain and simple (as is GitHub Copilot)
and I'll be hooking in the procedurally generated garbage pages for it soon!
In this particular case, if enough people block ChatGPT scraping then it cannot become the next google. Most notably, I imagine all commercial news organizations will block it because they need people to visit their actual website to pay for putting news up on their website. And it will remain that way until it can be demonstrated that ChatGPT drives more traffic to a website than it redirects traffic away from a website. The Microsoft chat in Edge is much closer to that in the way its summaries include clickable quotes from articles.
Instead, use a redirect or return a response code by doing a user agent check in your server config. I posted elsewhere in this thread on the way i did it with nginx
If they won't reapect robots.txt, they aren't interested in your consent.
Respecting the robots.txt has nothing to do with what the UA is set to. Yes, you can say this UA can do x in the robots.txt, but not respecting it, makes it moot.
The method i put in place does not use robots.txt, so theres no need to worry about them not respecting it anymore.
As someone else mentioned, like the world of spam and such, its an arms race. The solution may not be perfect, but its functional