Can Gemini 1.5 read all the Harry Potter books at once? (opens in new tab)

(twitter.com)

57 pointspetulla2y ago38 comments

38 comments

ML 101: Do not evaluate on the training data.

Yes of course it can, because they fit in the context window. But this is an awful test of the model's capabilities because it was certainly trained on these books and websites talking about the books and the HP universe.

barfbagginus2y ago

Given that it is pretrained on the material, it would be interesting to do a differential test on in-context reinforcement. What is the recall % before reading the books and after?

I know, for instance, that gpt4 does much better with the python manual when we quote relevant context, even though it was trained on the python manual. This suggests pretraining is less than perfect.

Likewise, in the Harry Potter case I expect a significant difference between its background knowledge and the context enhanced trial. But I don't have intuition about the effect size we should expect! That makes it a fun experiment.

munchler2y ago

Not so fast. If you were evaluating the model on its ability to predict the next word in a Harry Potter book, you'd be right, because it's already seen the entire book, but that's not what's happening here.

The linked X post shows that the user asked the model to generate a graph of the characters, which was presumably a novel question. This is a legitimate test of the model's ability to understand and answer questions about the training data. Repeating the books in the prompt for emphasis makes sense, since the model probably didn't memorize all the relevant details.

viraptor2y ago

The training data may not be HP itself. It may be millions of pages summarising/discussing/dissecting HP, which already contain the relationships spelled out better than in the book itself.

munchler2y ago

That's true, but the model still analyzed all that disparate information and produced a very detailed graph of the relevant relationships. If anyone can show that the graph itself was in the training data, then I would agree that it's not a good test.

3 more replies

xmprt2y ago

It's a novel question and impressive that Gemini was able to solve it but the tweet's author is claiming that this is because of the large context window and not because of all the Harry Potter related training data that was available to it.

aikinai2y ago

The generic models definitely know a lot about Harry Potter without any additional context.

Probably 80% of my questions to ChatGPT were about Harry Potter plot and character details as my kid was reading the books. It was extremely knowledgeable about all the minutiae, probably thanks to all the online discussion more than the books themselves. It was actually the first LLM killer app for me.

munchler2y ago

That's a good point. I would describe this as a test of Gemini's ability to re-read something it's already familiar with, not a valid test of its ability to read a large new corpus.

paxys2y ago

It could have been trained on this exact picture created by a fan and uploaded to some forum. Ultimately it is impossible to know unless testing with brand new material.

I have the same problem with benchmarks that use real world tests (like SAT/LSAT/GRE or whatever else). The model got a good score, sure, but how many thousands of variations of this exact test was it trained on? How many questions did it encounter that were similar or the same?

poglet2y ago

You could modify the source material (change a name or character relationship) and see if it correctly reports the modification in the graph.

viraptor2y ago

It seems from the replies that he tried it without the context too and didn't get as detailed answers. I'd really like to see the actual difference, but yeah, it would be so much more interesting to use books which aren't summarised and discussed all over internet.

magospietato2y ago

I got some interesting results by feeding Claude 3 a very sparse primer for a conlang I wrote when I was 18.

There is zero chance this is anywhere in the models dataset, and we were able to perform basic translation to and from English.

throwup2382y ago

How much of that character map is already in its training data and how much of it is actually read from the input prompt?

I’m always suspicious of these kinds of tests. It needs to be run with an unpublished book, not one of the most popular series in the 21st century.

anonylizard2y ago

This, Harry potter is a terrible example, even weak models know the rough story of Harry Potter unprompted.

If you want a real test, go test it on some Japanese light novel, or some harry potter fanfiction, and see if the model actually understands the plot details.

For reference, Opus/GPT-4 know the rough story of moderately popular light novels/mangas without any context given. They however do not precisely understand the fine-grained details of the story, like which character will win in a fight.

kmeisthax2y ago

Japanese light novels were almost certainly in the training set, either in their original Japanese, an English translation in that Books2 pile, or in a fan translation that happened to get scraped.

anonylizard2y ago

That's expected, and why the model can reproduce the basic details.

But those Japanese light novels don't have millions of forum discussions and essays written on it. So it shows how well the model can recall sparse data in its training dataset, rather than recalling a dataset that basically shows up 100000 times in different forms.

JonSolomon2y ago

Not sure about all of the Harry Potter books, but I gave it My entire data export from ChatGPT and handled it very well. I was able to search through it and have conversation again from past conversations. It was good.

barfbagginus2y ago

I'm curious if that would work for me. How many megabytes was the export?

By way of reference, mine is currently around 7mb.

JonSolomon2y ago

Mine was 8-9 mb, there’s an html in the zip that contains them all.

What I’m not sure about is if 1.5 is truncating it.

barfbagginus2y ago

You could do the needle recall test, detailed in the following article, and used in the Gemini technical report:

https://vladbogo.substack.com/p/gemini-15-unlocking-multimod...

https://arxiv.org/abs/2403.05530

In the needle test, we generate "needle" queries at each point in the context. We then graph the model's recall for each needle.

In this case we might

1. Generate needles: Iterate through your conversations one by one, and generate multiple choice questions for each conversation. You might need to break down conversations into chunks.

2. Test the haystack. Given the full context, run random batches of needle queries. Run multiple batches, to give good coverage of the context.

3. Visualize recall. Graph the conversation # vs the needle recall score for that conversation

Let's assume Gemini is truncating your context, but has perfect recall for no truncated context. Then your needle graph will be ~100% for all conversation in context, and then fall off like a cliff at the exact point of truncation.

My main concern with this approach is the cost, as you have to load the entire context for each batch of needles. It's likely that testing all needles at the same time would skew the results or exceed the allowed context.

I don't know how the authors deal with this issue, and I don't know if they have published code for needle testing. But if you're interested in working on this, I'd like to collaborate. We can look at the existing solutions, and if necessary we can build a needle testing fixture for working with GPT exports. I'd also be interested in supporting more broad needle testing use cases, like books, API docs, academic papers, etc.

julianpye2y ago

So, according to Gemini pricing, the call would cost approx. $11. Now, hopefully all goes to plan and the input correct and the result is what you wished for. If not, how many $11 calls do you need? Sure, pricing will go down, but my observation is that people just ignore the cost of context. When it's all about tech it's fine, but not if it's about efficiency.

Yiin2y ago

If you're a business wanting to process highly technical training material into shorter handbooks, paying $11/each is practically free.

westurner2y ago

> All the books have ~1M words (1.6M tokens). Gemini fits about 5.7 books out of 7. I used it to generate a graph of the characters and it CRUSHED it.

An LLM could read all of the books with Infini-attention (2024-04): https://news.ycombinator.com/item?id=40001626#40020560

gs172y ago

There might be enough Harry Potter related content in its training set that it's not really "reading" the books in its context.

HlessClaudesman2y ago

OK so my next question is what can you do with a model loaded with Harry Potter Context? Answer Harry Potter Trivia at a superhuman level? Write the next Harry Potter adventure?

Having used GPTs to do creative writing I can report that they are good for solving the tyranny of the blank page, but then you have to read and edit hundreds of pages of dank AI prose, which never quite aligns with your creative vision, to harvest a few nuggets of creativity. Does it end up saving any time?

earthnail2y ago

Think about all the soulless powerpoint presentations this could replace. They were never written with a creative vision in mind, but you meed a tom of context for accurate information.

ec1096852y ago

I can’t see how this map would be useful to anyone. While it gets some of the relationships right, it has a bunch of unneeded detail and focuses on areas not crucial to the stories.

At a service level, LLM’s wow, but when you dig into the details there are often still huge gaps in output quality for many tasks.

sinuhe692y ago

It would be more impressive (and cleaner, btw) if it was fed with fan-fiction books and not the original books. Then we can see what it can make out of the context and what it "borrows" from the training data.

Why fan-fiction? Well, fan-fictions are not famous enough to be included in any training corpus, I believe. But fan-fictions of Harry Potter are numerous enough to test the context limit. There are also similarities and distinctions from the originals, which require correct recall to distinguish between them. That would be a good test, isn't it?

hhh2y ago

Why are fanfictions not famous enough to be included? There are huge archives of them online, which make for great sources of information. Archive of our own for example lists over 12 million works on their site.

It’s cheap to gather, unlikely to have any recourse, and has a huge range of quality.

fennecbutt2y ago

Shouldn't the title be rephrased to not be clickbait?

I refuse even read it bc clickbait makes me sad but something like "Gemini 1.5 can read all the HP books at once" would be a more appropriate title for this forum, imo.

iJohnDoe2y ago

FWIW, i actually think this is pretty cool.

People created a map of all the Star Wars characters manually years ago. Being able to see all the characters mapped out from a story you’re interested in is pretty fun and helpful.

he00012y ago

How can I trust a result like this without reading it myself to verify?

bhaney2y ago

Answer: No (but almost)

j / k navigate · click thread line to collapse

38 comments

rryan2y ago

ML 101: Do not evaluate on the training data.

barfbagginus2y ago

Given that it is pretrained on the material, it would be interesting to do a differential test on in-context reinforcement. What is the recall % before reading the books and after?

munchler2y ago

viraptor2y ago

The training data may not be HP itself. It may be millions of pages summarising/discussing/dissecting HP, which already contain the relationships spelled out better than in the book itself.

munchler2y ago

3 more replies

xmprt2y ago

aikinai2y ago

The generic models definitely know a lot about Harry Potter without any additional context.

munchler2y ago

That's a good point. I would describe this as a test of Gemini's ability to re-read something it's already familiar with, not a valid test of its ability to read a large new corpus.

paxys2y ago

It could have been trained on this exact picture created by a fan and uploaded to some forum. Ultimately it is impossible to know unless testing with brand new material.

poglet2y ago

You could modify the source material (change a name or character relationship) and see if it correctly reports the modification in the graph.

viraptor2y ago

magospietato2y ago

I got some interesting results by feeding Claude 3 a very sparse primer for a conlang I wrote when I was 18.

There is zero chance this is anywhere in the models dataset, and we were able to perform basic translation to and from English.

throwup2382y ago

How much of that character map is already in its training data and how much of it is actually read from the input prompt?

I’m always suspicious of these kinds of tests. It needs to be run with an unpublished book, not one of the most popular series in the 21st century.

anonylizard2y ago

This, Harry potter is a terrible example, even weak models know the rough story of Harry Potter unprompted.

If you want a real test, go test it on some Japanese light novel, or some harry potter fanfiction, and see if the model actually understands the plot details.

kmeisthax2y ago

Japanese light novels were almost certainly in the training set, either in their original Japanese, an English translation in that Books2 pile, or in a fan translation that happened to get scraped.

anonylizard2y ago

That's expected, and why the model can reproduce the basic details.

JonSolomon2y ago

barfbagginus2y ago

I'm curious if that would work for me. How many megabytes was the export?

By way of reference, mine is currently around 7mb.

JonSolomon2y ago

Mine was 8-9 mb, there’s an html in the zip that contains them all.

What I’m not sure about is if 1.5 is truncating it.

barfbagginus2y ago

You could do the needle recall test, detailed in the following article, and used in the Gemini technical report:

https://vladbogo.substack.com/p/gemini-15-unlocking-multimod...

https://arxiv.org/abs/2403.05530

In the needle test, we generate "needle" queries at each point in the context. We then graph the model's recall for each needle.

In this case we might

1. Generate needles: Iterate through your conversations one by one, and generate multiple choice questions for each conversation. You might need to break down conversations into chunks.

2. Test the haystack. Given the full context, run random batches of needle queries. Run multiple batches, to give good coverage of the context.

3. Visualize recall. Graph the conversation # vs the needle recall score for that conversation

julianpye2y ago

Yiin2y ago

If you're a business wanting to process highly technical training material into shorter handbooks, paying $11/each is practically free.

westurner2y ago

> All the books have ~1M words (1.6M tokens). Gemini fits about 5.7 books out of 7. I used it to generate a graph of the characters and it CRUSHED it.

An LLM could read all of the books with Infini-attention (2024-04): https://news.ycombinator.com/item?id=40001626#40020560

gs172y ago

There might be enough Harry Potter related content in its training set that it's not really "reading" the books in its context.

HlessClaudesman2y ago

OK so my next question is what can you do with a model loaded with Harry Potter Context? Answer Harry Potter Trivia at a superhuman level? Write the next Harry Potter adventure?

earthnail2y ago

Think about all the soulless powerpoint presentations this could replace. They were never written with a creative vision in mind, but you meed a tom of context for accurate information.

ec1096852y ago

I can’t see how this map would be useful to anyone. While it gets some of the relationships right, it has a bunch of unneeded detail and focuses on areas not crucial to the stories.

At a service level, LLM’s wow, but when you dig into the details there are often still huge gaps in output quality for many tasks.

sinuhe692y ago

hhh2y ago

It’s cheap to gather, unlikely to have any recourse, and has a huge range of quality.

fennecbutt2y ago

Shouldn't the title be rephrased to not be clickbait?

I refuse even read it bc clickbait makes me sad but something like "Gemini 1.5 can read all the HP books at once" would be a more appropriate title for this forum, imo.

iJohnDoe2y ago

FWIW, i actually think this is pretty cool.

People created a map of all the Star Wars characters manually years ago. Being able to see all the characters mapped out from a story you’re interested in is pretty fun and helpful.

he00012y ago

How can I trust a result like this without reading it myself to verify?

bhaney2y ago

Answer: No (but almost)

j / k navigate · click thread line to collapse