Yes of course it can, because they fit in the context window. But this is an awful test of the model's capabilities because it was certainly trained on these books and websites talking about the books and the HP universe.
I know, for instance, that gpt4 does much better with the python manual when we quote relevant context, even though it was trained on the python manual. This suggests pretraining is less than perfect.
Likewise, in the Harry Potter case I expect a significant difference between its background knowledge and the context enhanced trial. But I don't have intuition about the effect size we should expect! That makes it a fun experiment.
The linked X post shows that the user asked the model to generate a graph of the characters, which was presumably a novel question. This is a legitimate test of the model's ability to understand and answer questions about the training data. Repeating the books in the prompt for emphasis makes sense, since the model probably didn't memorize all the relevant details.
I have the same problem with benchmarks that use real world tests (like SAT/LSAT/GRE or whatever else). The model got a good score, sure, but how many thousands of variations of this exact test was it trained on? How many questions did it encounter that were similar or the same?
There is zero chance this is anywhere in the models dataset, and we were able to perform basic translation to and from English.
I’m always suspicious of these kinds of tests. It needs to be run with an unpublished book, not one of the most popular series in the 21st century.
If you want a real test, go test it on some Japanese light novel, or some harry potter fanfiction, and see if the model actually understands the plot details.
For reference, Opus/GPT-4 know the rough story of moderately popular light novels/mangas without any context given. They however do not precisely understand the fine-grained details of the story, like which character will win in a fight.
By way of reference, mine is currently around 7mb.
What I’m not sure about is if 1.5 is truncating it.
An LLM could read all of the books with Infini-attention (2024-04): https://news.ycombinator.com/item?id=40001626#40020560
Having used GPTs to do creative writing I can report that they are good for solving the tyranny of the blank page, but then you have to read and edit hundreds of pages of dank AI prose, which never quite aligns with your creative vision, to harvest a few nuggets of creativity. Does it end up saving any time?
At a service level, LLM’s wow, but when you dig into the details there are often still huge gaps in output quality for many tasks.
Why fan-fiction? Well, fan-fictions are not famous enough to be included in any training corpus, I believe. But fan-fictions of Harry Potter are numerous enough to test the context limit. There are also similarities and distinctions from the originals, which require correct recall to distinguish between them. That would be a good test, isn't it?
It’s cheap to gather, unlikely to have any recourse, and has a huge range of quality.
I refuse even read it bc clickbait makes me sad but something like "Gemini 1.5 can read all the HP books at once" would be a more appropriate title for this forum, imo.
People created a map of all the Star Wars characters manually years ago. Being able to see all the characters mapped out from a story you’re interested in is pretty fun and helpful.