undefined | Better HN

0 pointsgrey-area1mo ago0 comments

Surely the corpus Opus 4.6 ingested would include whatever reference you used to check the spells were there. I mean, there are probably dozens of pages on the internet like this:

https://www.wizardemporium.com/blog/complete-list-of-harry-p...

Why is this impressive?

Do you think it's actually ingesting the books and only using those as a reference? Is that how LLMs work at all? It seems more likely it's predicting these spell names from all the other references it has found on the internet, including lists of spells.

0 comments

sigmoid101mo ago

Most people still don't realize that general public world knowledge is not really a test for a model that was trained on general public world knowledge. I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data, despite what publishers and authors may think of that. As a matter of fact, with all the special deals these companies make with publishers, it is getting harder and harder for normal users to come up with validation data that only they have seen. At least for human written text, this kind of data is more or less reserved for specialist industries and higher academia by now. If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

rendx1mo ago

> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data

No need for surprises! It is publicly known that the corpus of 'shadow libraries' such as Library Genesis and Anna's Archive were specifically and manually requested by at least NVIDIA for their training data [1], used by Google in their training [2], downloaded by Meta employees [3] etc.

[1] https://news.ycombinator.com/item?id=46572846

[2] https://www.theguardian.com/technology/2023/apr/20/fresh-con...

[3] https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...

paodealho1mo ago

also:

"Researchers Extract Nearly Entire Harry Potter Book From Commercial LLMs"

https://www.aitechsuite.com/ai-news/ai-shock-researchers-ext...

sigmoid101mo ago

The big AI houses are all in involved in varying degrees of litigation (all the way to class action lawsuits) with the big publishing houses. I think they at least have some level of filtering for their training data to keep them legally somewhat compliant. But considering how much copyrighted stuff is spread blisfully online, it is probably not enough to filter out the actual ebooks of certain publishers.

1 more reply

joenot4431mo ago

> even proprietary content like the books themselves

This definitely raises an interesting question. It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files. Immediately to mind was House of Leaves, Infinite Jest, Harry Potter, basically any Stephen King book - they've all been posted at some point.

Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?

rendx1mo ago

> It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files

Anna's Archive alone claims to currently publicly host 61,654,285 books, more than 1PB in total.

yunohn1mo ago

Maybe y’all missed this?

https://www.washingtonpost.com/technology/2026/01/27/anthrop...

Anthropic, specifically, ingested libraries of books by scanning and then disposing of them.

beepbooptheory1mo ago

> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

The plot of Good Will Hunting would like a word.

MarcellusDrum1mo ago

So a good test would be replacing the spell names in the books with made-up spells. And if a "real" spell name was given, it also tests whether it "cheated".

ggrab1mo ago

I've run that experiment now, spoiler: It cheated with its pre-training knowledge https://georggrab.net/content/opus46retrieval.html

MarcellusDrum1mo ago

Thanks for trying! Good to know.

outofpaper1mo ago

A real test is synthesizing 100,000 sentences of this slect random ones and then inject the traits you want thr LLM to detect and describe, eg have a set of words or phrases that may represent spells and have them used so that they do something. Then have the LLM find these random spells in the random corpus.

lxgr1mo ago

It could still remember where each spell is mentioned. I think the only way to properly test this would be to run it against an unpublished manuscript.

staticman21mo ago

Any obscure work of fiction or fanfiction would likely be fine as a casual test.

If you ask a model to discuss an obscure work it'll have no clue what it's about.

This is very different than asking about Harry Potter.

1 more reply

vercaemert1mo ago

It's impressive, even if the books and the posts you're talking about were both key parts of the training data.

There are many academic domains where the research portion of a PhD is essentially what the model just did. For example, PhD students in some of the humanities will spend years combing ancient sources for specific combinations of prepositions and objects, only to write a paper showing that the previous scholars were wrong (and that a particular preposition has examples of being used with people rather than places).

This sort of experiment shows that Opus would be good at that. I'm assuming it's trivial for the OP to extend their experiment to determine how many times "wingardium leviosa" was used on an object rather than a person.

(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)

adastra221mo ago

I don’t think this example proves your point. There’s no indication that the model actually worked this out from the input context, instead of regurgitating it from the training weights. A better test would be to subtly modify the books fed in as input to the model so that there was actually 51 spells, and see if it pulls out the extra spell, or to modify the names of some spells, etc.

In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.

vercaemert1mo ago

Ah, that's a good point.

fastasucan1mo ago

Since it got 49 of 50 right its worse than what you would get using a simple google search. People would immediately disregard a conventional source that only listed 49 out of 50.

ehatr1mo ago

The poster you reply to works in AI. The marketing strategy is to always have a cute Pelican or Harry Potter comment as the top comment for positive associations.

The poster knows all of that, this is plain marketing.

throw109201mo ago

This sounds compelling, but also something that an armchair marketer would have theorycrafted without any real-world experience or evidence that it actually works - and I searched online and can't find any references to something like it.

Do you have a citation for this?

rlt1mo ago

They should try the same thing but replace the original spell names with something else.

zaphirplane1mo ago

Why doesn’t you ask it and find out ;)

grey-areaOP1mo ago

Because the model doesn't know but will happily tell a convincing lie about how it works.

j / k navigate · click thread line to collapse

0 comments

sigmoid101mo ago

rendx1mo ago

> I wouldn't be surprised if even proprietary content like the books themselves found their way into the training data

[1] https://news.ycombinator.com/item?id=46572846

[2] https://www.theguardian.com/technology/2023/apr/20/fresh-con...

[3] https://www.theverge.com/2023/7/9/23788741/sarah-silverman-o...

paodealho1mo ago

also:

"Researchers Extract Nearly Entire Harry Potter Book From Commercial LLMs"

https://www.aitechsuite.com/ai-news/ai-shock-researchers-ext...

sigmoid101mo ago

1 more reply

joenot4431mo ago

> even proprietary content like the books themselves

Do LLMS have a good way of inferring where knowledge from the context begins and knowledge from the training data ends?

rendx1mo ago

> It seems like a good chunk of popular literature (especially from the 2000s) exists online in big HTML files

Anna's Archive alone claims to currently publicly host 61,654,285 books, more than 1PB in total.

yunohn1mo ago

Maybe y’all missed this?

https://www.washingtonpost.com/technology/2026/01/27/anthrop...

Anthropic, specifically, ingested libraries of books by scanning and then disposing of them.

beepbooptheory1mo ago

> If you're a janitor with a high school diploma, there may be barely any textual information or fact you have ever consumed that such a model hasn't seen during training already.

The plot of Good Will Hunting would like a word.

MarcellusDrum1mo ago

So a good test would be replacing the spell names in the books with made-up spells. And if a "real" spell name was given, it also tests whether it "cheated".

ggrab1mo ago

I've run that experiment now, spoiler: It cheated with its pre-training knowledge https://georggrab.net/content/opus46retrieval.html

MarcellusDrum1mo ago

Thanks for trying! Good to know.

outofpaper1mo ago

lxgr1mo ago

It could still remember where each spell is mentioned. I think the only way to properly test this would be to run it against an unpublished manuscript.

staticman21mo ago

Any obscure work of fiction or fanfiction would likely be fine as a casual test.

If you ask a model to discuss an obscure work it'll have no clue what it's about.

This is very different than asking about Harry Potter.

1 more reply

vercaemert1mo ago

It's impressive, even if the books and the posts you're talking about were both key parts of the training data.

(It's worth noting that other models are decent at this, and you would need to find a way to benchmark between them.)

adastra221mo ago

In your example, it might be the case that the model simply spits out consensus view, rather than actually finding/constructing this information on his own.

vercaemert1mo ago

Ah, that's a good point.

fastasucan1mo ago

Since it got 49 of 50 right its worse than what you would get using a simple google search. People would immediately disregard a conventional source that only listed 49 out of 50.

ehatr1mo ago

The poster you reply to works in AI. The marketing strategy is to always have a cute Pelican or Harry Potter comment as the top comment for positive associations.

The poster knows all of that, this is plain marketing.

throw109201mo ago

Do you have a citation for this?

rlt1mo ago

They should try the same thing but replace the original spell names with something else.

zaphirplane1mo ago

Why doesn’t you ask it and find out ;)

grey-areaOP1mo ago

Because the model doesn't know but will happily tell a convincing lie about how it works.

j / k navigate · click thread line to collapse