undefined | Better HN

0 pointszeta01347mo ago0 comments

If the LLM were running this sort of thing at the user's explicit request this would be fine. The problem is training. Every AI startup on the planet right now is aggressively crawling everything that will let them crawl. The server isn't seeing occasional summaries from interested users, but thousands upon thousands of bots repeatedly requesting every link they can find as fast as they can.

0 comments

fxtentacle7mo ago

Then what if I ask the LLM 10 questions about the same domain and ask it to research further? Any human would then click through 50 - 100 articles to make sure they know what that domain contains. If that part is automated by using an LLM, does that make any legal change? How many page URLs do you think one should be allowed to access per LLM prompt?

zeta0134OP7mo ago

All of them. That's at the explicit request of the user. I'm not sure where the downvotes are coming from, since I agree with all of these points. The training thing has merely pissed off lots of server operators already, so they quite reasonably tend to block first and ask questions later. I think that's important context.

hombre_fatal7mo ago

TFA isn’t talking about crawling to harvest training data.

It’s talking about Perplexity crawling sites on demand in response to user queries and then complaining that no it’s not fine, hence this thread.

cjonas7mo ago

Doesn't perplexity crawl to harvest and index data like a traditional search engine? Or is it all "on demand"?

lukeschlather7mo ago

For the most part I would assume they pay for access to Google or Bing's index. I also assume they don't really train models. So all their "crawling" is on behalf of users.

mnmalst7mo ago

But that's not what this article is about. From, what I understand, this articles is about a user requesting information about a specific domain and not general scraping.

j / k navigate · click thread line to collapse