Sites scramble to block ChatGPT web crawler after instructions emerge (opens in new tab)

(arstechnica.com)

71 pointsspecto2y ago34 comments

34 comments

I have two sites that provide documentation for open source libraries I've created, and I definitely won't be blocking ChatGPT. It has already read my documentation and can correctly answer most StackOverflow-level questions about my libraries' use. This is seriously impressive and very helpful, as far as I'm concerned.

vouaobrasil2y ago

From the article:

> For example, blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future.

I would rather leave the internet entirely if AI chatbots become a primary user interface.

taftster2y ago

What's interesting about your statement is that if you knew they were chatbots, then by definition, they wouldn't be AI. e.g. they wouldn't have passed the Turing test.

I know what you're saying, and totally agree. Unfortunately the term "AI" is now meaningless.

vouaobrasil2y ago

True enough, but one doesn't need to know whether they are Chatbots. In fact, one doesn't even have to have ever heard of AI/ChatGPT to know that something fundamnetally inhuman and strange were going on. If a person woke up from a coma after Chatbots had taken over the internet, they would immediately notice something very strange and machine-like about the whole experience.

whelp_242y ago

That's not what AI means. Also chatgpt can beat the Turing test often enough.

danielbln2y ago

Complaining about overloading the word "AI" is like complaining about the term "Cloud". Might as well move past it.

tomjen32y ago

Why? That would imply that they are pretty good for most things - better than what we have now.

vouaobrasil2y ago

Absolutely not. That would imply that Chatbots are an optimal solution where the optimality criteria are short-term financial gains for mega-enterprises and the maladaptive instincts of humanity in the unnatural situation of being forced into a highly technological world.

Sugary cereals and desserts have taken over much of snacking today, doesn't mean it's a good thing.

8organicbits2y ago

I wonder if its worth poisoning the replies for scrapers that don't obey robots.txt. Send back nonsense, lies, and noise. This would be an adversarial approach like https://adnauseam.io/ uses for ad tracking.

vouaobrasil2y ago

If you (or others) come up with a way to build a system to poison AI/LLM/other models to make them useless, count me in to help.

TheCaptain48152y ago

I’d imagine this is best possible via illegal methods such as mass hacking websites and inserting the appropriate poison

ajdude2y ago

Years ago I came across an email crawler trap, where if the bot was unfortunate enough to come across it, it would generate (from the e-mail harvesting bot's point of view) an endless and nested tree of pages with randomly generated garbage emails. It was just a bit of PHP but I wouldn't be surprised if you couldn't hear something that the LLM thinks are comments but It's just randomly generated garbage.

JohnFen2y ago

> blocking GPTBot will not guarantee that a site's data does not end up training all AI models of the future. Aside from issues of scrapers ignoring robots.txt files, there are other large data sets of scraped websites (such as The Pile) that are not affiliated with OpenAI.

This is why I'm not reassured. robots.txt isn't sufficient to stop all webcrawlers, so there every reason to think it isn't sufficient to stop AI scrapers.

I'm still wanting to find a good solution to this problem so that I can open my sites up to the public again.

none_to_remain2y ago

I think bots are part of the public.

JohnFen2y ago

OK, then pretend that I said "open my site up to the human public", instead.

rootw0rm2y ago

There's never going to be a perfect solution, it's an arms race. I really doubt (hope?) that large entities are going to straight up emulate end-user browsers though.

I would think filtering based on user agent will be the sweet spot for effort and performance. You could do some awful JavaScript monstrosity to detect the tiny fraction of bots who are sneaky, but if they're determined to be sneaky they will succeed at scraping.

JohnFen2y ago

User agent matching isn't good enough. The stakes are high -- all it takes is one AI crawler to grab my site data, and that data is included in the training forever more.

> if they're determined to be sneaky they will succeed at scraping.

Yes, which is why I suspect I will never be able to open my websites up to the general public again. I live in hope anyway.

rpgwaiter2y ago

Browsers aren't really trusted platforms, the cool scraping is in emulating phones. Whether that be in actually running a virtual phone or sending traffic that emulates it

Really just encourages phones to be even more locked down

CableNinja2y ago

I chose to use an nginx entry, because i also dont trust them to follow robots.txt. Throwing a 410 Gone should keep them from coming back too, theoretically, assuming they actually eject when receiving it, like it should.

`if ($http_user_agent ~* ".*?(GPTBot|AI).*?") { return 410; }`

Its not perfect, but it should filter them indefinitely, will probably have to add some more terms in there over time.

JohnFen2y ago

That's relying on the user agent, though. That's not a trustworthy enough signal for me. For one, crawlers can use any user agent string they like. For another, I don't know what all the possible user agent strings are.

wildpeaks2y ago

This gives the illusion of being in control, but if enough people block the bot, they'll just scrape differently (if they don't already) because too much money is at stake, more than whatever fine they may get if they do get caught and can't settle out of court, not to mention they may consider it will be someone else's problem by then.

It's more pragmatic to expect that any data that can be accessed one way or another will be scraped because interests aren't aligned between content authors and scrapers.

On the other hand, robots.txt was benefiting both search engines and content authors because it signaled data that wasn't useful to show in search results, therefore search engines had an incentive to follow its rules.

xyzal2y ago

I think a strategy that might punish such misbehaved crawlers is trapping them (via invisible ahrefs perhaps ... ?) and feeding them truckloads of garbage/disinformational/esoteric/plain wrong text.

Or perhaps all crawlers regardless of respecting robots. Honestly I am not interested in improving some FAANGish algorithm with blogposts intended for my friends.

blibble2y ago

blocked it on every single site I manage

there is zero benefit to me in allowing OpenAI to absorb my content

it is a parasite, plain and simple (as is GitHub Copilot)

and I'll be hooking in the procedurally generated garbage pages for it soon!

tedunangst2y ago

How do you assess which robots benefit you?

amelius2y ago

What if it becomes the next Google? Are you really sure you want to be removed from their index?

vouaobrasil2y ago

Not the OP, but I already blocked it with my robots.txt. I am 100% sure I want to be removed from their index, even if they become "the next Google". I would rather have my website fade away into obscurity than increase the usefulness of their AI or any other AI model.

stubish2y ago

If something happens, then you can do something about it.

In this particular case, if enough people block ChatGPT scraping then it cannot become the next google. Most notably, I imagine all commercial news organizations will block it because they need people to visit their actual website to pay for putting news up on their website. And it will remain that way until it can be demonstrated that ChatGPT drives more traffic to a website than it redirects traffic away from a website. The Microsoft chat in Edge is much closer to that in the way its summaries include clickable quotes from articles.

OJFord2y ago

As it is, as GP says there would be 'zero benefit' to being included in that new Google, to GP as content author.

anotherhue2y ago

You are promoting FOMO.

karaterobot2y ago

The article does not say whether it obeys `User-agent: *`. My guess is that, if it doesn't respect that, it doesn't truly respect `User-agent: GPTBot` either.

askvictor2y ago

I've been reading lots of datasheets and application notes in the embedded space recently. Most of these are only accessible after creating a (free) login. In one sense, it's a reasonably simple way to prevent scraping like this (at least until the AI-based scrapers can generate their own logins). On the other hand, a lot of that kind of material would be _really_ useful to be able to ask an LLM about.

CableNinja2y ago

For anyone reading this, you can skip the robots.txt, as others have pointed out, who knows if they will actually listen to it.

Instead, use a redirect or return a response code by doing a user agent check in your server config. I posted elsewhere in this thread on the way i did it with nginx

salawat2y ago

...You realize user agent is operator set, right?

If they won't reapect robots.txt, they aren't interested in your consent.

CableNinja2y ago

Respecting robots.txt and setting UA are two different things. And yes, i know UA can be set to anything, however, the UA has been mentioned, and it shouldnt change drastically, under normal circumstances by a lot of these scrapers.

Respecting the robots.txt has nothing to do with what the UA is set to. Yes, you can say this UA can do x in the robots.txt, but not respecting it, makes it moot.

The method i put in place does not use robots.txt, so theres no need to worry about them not respecting it anymore.

As someone else mentioned, like the world of spam and such, its an arms race. The solution may not be perfect, but its functional

j / k navigate · click thread line to collapse

34 comments

Meekro2y ago

vouaobrasil2y ago

From the article:

> For example, blocking content from future AI models could decrease a site's or a brand's cultural footprint if AI chatbots become a primary user interface in the future.

I would rather leave the internet entirely if AI chatbots become a primary user interface.

taftster2y ago

What's interesting about your statement is that if you knew they were chatbots, then by definition, they wouldn't be AI. e.g. they wouldn't have passed the Turing test.

I know what you're saying, and totally agree. Unfortunately the term "AI" is now meaningless.

vouaobrasil2y ago

whelp_242y ago

That's not what AI means. Also chatgpt can beat the Turing test often enough.

danielbln2y ago

Complaining about overloading the word "AI" is like complaining about the term "Cloud". Might as well move past it.

tomjen32y ago

Why? That would imply that they are pretty good for most things - better than what we have now.

vouaobrasil2y ago

Sugary cereals and desserts have taken over much of snacking today, doesn't mean it's a good thing.

8organicbits2y ago

vouaobrasil2y ago

If you (or others) come up with a way to build a system to poison AI/LLM/other models to make them useless, count me in to help.

TheCaptain48152y ago

I’d imagine this is best possible via illegal methods such as mass hacking websites and inserting the appropriate poison

ajdude2y ago

JohnFen2y ago

This is why I'm not reassured. robots.txt isn't sufficient to stop all webcrawlers, so there every reason to think it isn't sufficient to stop AI scrapers.

I'm still wanting to find a good solution to this problem so that I can open my sites up to the public again.

none_to_remain2y ago

I think bots are part of the public.

JohnFen2y ago

OK, then pretend that I said "open my site up to the human public", instead.

rootw0rm2y ago

There's never going to be a perfect solution, it's an arms race. I really doubt (hope?) that large entities are going to straight up emulate end-user browsers though.

JohnFen2y ago

User agent matching isn't good enough. The stakes are high -- all it takes is one AI crawler to grab my site data, and that data is included in the training forever more.

> if they're determined to be sneaky they will succeed at scraping.

Yes, which is why I suspect I will never be able to open my websites up to the general public again. I live in hope anyway.

rpgwaiter2y ago

Browsers aren't really trusted platforms, the cool scraping is in emulating phones. Whether that be in actually running a virtual phone or sending traffic that emulates it

Really just encourages phones to be even more locked down

CableNinja2y ago

`if ($http_user_agent ~* ".*?(GPTBot|AI).*?") { return 410; }`

Its not perfect, but it should filter them indefinitely, will probably have to add some more terms in there over time.

JohnFen2y ago

wildpeaks2y ago

It's more pragmatic to expect that any data that can be accessed one way or another will be scraped because interests aren't aligned between content authors and scrapers.

xyzal2y ago

I think a strategy that might punish such misbehaved crawlers is trapping them (via invisible ahrefs perhaps ... ?) and feeding them truckloads of garbage/disinformational/esoteric/plain wrong text.

Or perhaps all crawlers regardless of respecting robots. Honestly I am not interested in improving some FAANGish algorithm with blogposts intended for my friends.

blibble2y ago

blocked it on every single site I manage

there is zero benefit to me in allowing OpenAI to absorb my content

it is a parasite, plain and simple (as is GitHub Copilot)

and I'll be hooking in the procedurally generated garbage pages for it soon!

tedunangst2y ago

How do you assess which robots benefit you?

amelius2y ago

What if it becomes the next Google? Are you really sure you want to be removed from their index?

vouaobrasil2y ago

stubish2y ago

If something happens, then you can do something about it.

OJFord2y ago

As it is, as GP says there would be 'zero benefit' to being included in that new Google, to GP as content author.

anotherhue2y ago

You are promoting FOMO.

karaterobot2y ago

The article does not say whether it obeys `User-agent: *`. My guess is that, if it doesn't respect that, it doesn't truly respect `User-agent: GPTBot` either.

askvictor2y ago

CableNinja2y ago

For anyone reading this, you can skip the robots.txt, as others have pointed out, who knows if they will actually listen to it.

Instead, use a redirect or return a response code by doing a user agent check in your server config. I posted elsewhere in this thread on the way i did it with nginx

salawat2y ago

...You realize user agent is operator set, right?

If they won't reapect robots.txt, they aren't interested in your consent.

CableNinja2y ago

Respecting the robots.txt has nothing to do with what the UA is set to. Yes, you can say this UA can do x in the robots.txt, but not respecting it, makes it moot.

The method i put in place does not use robots.txt, so theres no need to worry about them not respecting it anymore.

As someone else mentioned, like the world of spam and such, its an arms race. The solution may not be perfect, but its functional

j / k navigate · click thread line to collapse