If the LLM were running this sort of thing at the user's explicit request this would be fine. The problem is training. Every AI startup on the planet right now is aggressively crawling everything that will let them crawl. The server isn't seeing occasional summaries from interested users, but thousands upon thousands of bots repeatedly requesting every link they can find as fast as they can.
Then what if I ask the LLM 10 questions about the same domain and ask it to research further? Any human would then click through 50 - 100 articles to make sure they know what that domain contains. If that part is automated by using an LLM, does that make any legal change? How many page URLs do you think one should be allowed to access per LLM prompt?
All of them. That's at the explicit request of the user. I'm not sure where the downvotes are coming from, since I agree with all of these points. The training thing has merely pissed off lots of server operators already, so they quite reasonably tend to block first and ask questions later. I think that's important context.
For the most part I would assume they pay for access to Google or Bing's index. I also assume they don't really train models. So all their "crawling" is on behalf of users.
But that's not what this article is about. From, what I understand, this articles is about a user requesting information about a specific domain and not general scraping.