undefined | Better HN

0 pointsadamgordonbell1y ago0 comments

There is a benchmark, NovelQA, that LLMs don't dominate when it feels like they should. The benchmark is to read a novel and answer questions about it.

LLMs are below human evaluation, as I last looked, but it doesn't get much attention.

Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.

https://novelqa.github.io/

0 comments

loxias1y ago

NovelQA is a great one! I also like GSM-Symbolic -- a benchmark based on making _symbolic templates_ of quite easy questions, and sampling them repeatedly, varying things like which proper nouns are used, what order relevant details appear, how many irrelevant details (GSM-NoOp) and where they are in the question, things like that.

LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)

https://machinelearning.apple.com/research/gsm-symbolic

https://arxiv.org/pdf/2410.05229

Paper came out in October, I don't think many have fully absorbed the implications.

It's hard to take any of the claims of "LLMs can do reasoning!" seriously, once you understand that simply changing what names are used in a 8th grade math word problem can have dramatic impact on the accuracy.

meta_x_ai1y ago

Looks like it's not updated for nearly a year and I'm guessing Gemini 2.0 Flash with 2m context will simply crush it

1 more reply

latency-guy21y ago

> I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

I would think this is a not so good bench. Author does not write logically, they write for entertainment.

1 more reply

usaar3331y ago

That's an old leaderboard -- has no one checked any SOTA LLM in the last 8 months?

CamperBob21y ago

Does it work on short stories, but not novels? If so, then that's just a minor question of context length that should self-resolve over time.

1 more reply

rowanG0771y ago

Benchmark how? Is it good if the LLM can or can't solve it?

j / k navigate · click thread line to collapse

0 comments

loxias1y ago

LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)

https://machinelearning.apple.com/research/gsm-symbolic

https://arxiv.org/pdf/2410.05229

Paper came out in October, I don't think many have fully absorbed the implications.

meta_x_ai1y ago

Looks like it's not updated for nearly a year and I'm guessing Gemini 2.0 Flash with 2m context will simply crush it

1 more reply

latency-guy21y ago

> I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

I would think this is a not so good bench. Author does not write logically, they write for entertainment.

1 more reply

usaar3331y ago

That's an old leaderboard -- has no one checked any SOTA LLM in the last 8 months?

CamperBob21y ago

Does it work on short stories, but not novels? If so, then that's just a minor question of context length that should self-resolve over time.

1 more reply

rowanG0771y ago

Benchmark how? Is it good if the LLM can or can't solve it?

j / k navigate · click thread line to collapse