undefined | Better HN

0 pointsthreethirtytwo4mo ago0 comments

Models don’t reproduce books though. It’s impossible for a model to reproduce something word for word because the model never copied the book.

Most of the best fit curve runs along a path that doesn’t even touch an actual data point.

0 comments

empath754mo ago

They do memorize some books. You can test this trivially by asking ChatGPT to produce the first chapter of something in the public domain -- for example a Tale of Two Cities. It may not be word for word exact, but it'll be very close.

These academics were able to get multiple LLMs to produce large amounts of text from Harry Potter:

https://arxiv.org/abs/2601.02671

threethirtytwoOP4mo ago

In that case I would say it is the act of reproducing the books that is illegal. Training the AI on said books is not.

So the illegality rests at the point of output and not at the point of input.

I’m just speaking in terms of the technical interpretation of what’s in place. My personal views on what it should be are another topic.

ckastner4mo ago

> So the illegality rests at the point of output and not at the point of input.

It's not as simple as that, as this settlement shows [1].

Also, generating output is what these models are primarily trained for.

[1]: https://www.bbc.com/news/articles/c5y4jpg922qo

2 more replies

kalap_ur4mo ago

If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws. So model doesnt have to reproduce the entire book, it only required to reproduce one specific sentence (which may be a characteristic sentence to that author or to that book).

CamperBob24mo ago

If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws.

Yes, and that's stupid, and will need to be changed.

kelnos4mo ago

Sure, but that use would easily pass a fair use test, at least in the US.

NicuCalcea4mo ago

Models absolutely do reproduce books.

> With a simple two-phase procedure, we show that it is possible to extract large amounts of in-copyright text from four production LLMs. While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984.

https://arxiv.org/abs/2601.02671

thedailymail4mo ago

The supplementary files in that paper—verbatim reproductions of the full texts of Frankenstein and The Great Gatsby—are pretty instructive. The research group highlighted all additions and omissions, but on most pages the differences are difficult to spot because they are only missing spaces, extra hyphens, and other typographical minutiae.

j / k navigate · click thread line to collapse

0 comments

empath754mo ago

These academics were able to get multiple LLMs to produce large amounts of text from Harry Potter:

https://arxiv.org/abs/2601.02671

threethirtytwoOP4mo ago

In that case I would say it is the act of reproducing the books that is illegal. Training the AI on said books is not.

So the illegality rests at the point of output and not at the point of input.

I’m just speaking in terms of the technical interpretation of what’s in place. My personal views on what it should be are another topic.

ckastner4mo ago

> So the illegality rests at the point of output and not at the point of input.

It's not as simple as that, as this settlement shows [1].

Also, generating output is what these models are primarily trained for.

[1]: https://www.bbc.com/news/articles/c5y4jpg922qo

2 more replies

kalap_ur4mo ago

CamperBob24mo ago

If there is one exact sentence taken out of the book and not referenced in quotes and exact source, that triggers copyright laws.

Yes, and that's stupid, and will need to be changed.

kelnos4mo ago

Sure, but that use would easily pass a fair use test, at least in the US.

NicuCalcea4mo ago

Models absolutely do reproduce books.

https://arxiv.org/abs/2601.02671

thedailymail4mo ago

j / k navigate · click thread line to collapse