Perhaps, but reproducing the book from this memory could very well be illegal.
And these models are all about production.
Most of the best fit curve runs along a path that doesn’t even touch an actual data point.
These academics were able to get multiple LLMs to produce large amounts of text from Harry Potter:
So the illegality rests at the point of output and not at the point of input.
I’m just speaking in terms of the technical interpretation of what’s in place. My personal views on what it should be are another topic.
Yes, and that's stupid, and will need to be changed.
> With a simple two-phase procedure, we show that it is possible to extract large amounts of in-copyright text from four production LLMs. While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984.