undefined | Better HN

0 pointsbrendoelfrendo9mo ago0 comments

Bad news: it doesn't seem to work as well as you might think: https://arxiv.org/pdf/2508.01191

As one might expect, because the AI isn't actually thinking, it's just spending more tokens on the problem. This sometimes leads to the desired outcome but the phenomenon is very brittle and disappears when the AI is pushed outside the bounds of its training.

To quote their discussion, "CoT is not a mechanism for genuine logical inference but rather a sophisticated form of structured pattern matching, fundamentally bounded by the data distribution seen during training. When pushed even slightly beyond this distribution, its performance degrades significantly, exposing the superficial nature of the “reasoning” it produces."

0 comments

hodgehog119mo ago

I keep wondering whether people have actually examined how this work draws its conclusions before citing it.

This is science at its worst, where you start at an inflammatory conclusion and work backwards. There is nothing particularly novel presented here, especially not in the mathematics; obviously performance will degrade on out-of-distribution tasks (and will do so for humans under the same formulation), but the real question is how out-of-distribution a lot of tasks actually are if they can still be solved with CoT. Yes, if you restrict the dataset, then it will perform poorly. But humans already have a pretty large visual dataset to pull from, so what are we comparing to here? How do tiny language models trained on small amounts of data demonstrate fundamental limitations?

I'm eager to see more works showing the limitations of LLM reasoning, both at small and large scale, but this ain't it. Others have already supplied similar critiques, so let's please stop sharing this one around without the grain of salt.

ipaddr9mo ago

"This is science at its worst, where you start at an inflammatory conclusion and work backwards"

Science starts with a guess and you run experiments to test.

hodgehog119mo ago

True, but the experiments are engineered to give results they want. It's a mathematical certainty that the performance will drop off here, but is not an accurate assessment of what is going on at scale. If you present an appropriately large and well-trained model with in-context patterns, it often does a decent job, even when it isn't trained on them. By nerfing the model (4 layers), the conclusion is foregone.

I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.

lossolo9mo ago

Without a provable hold out, claim that "large models do fine on unseen patterns" is unfalsifiable. In controlled from scratch training, CoT performance collapses under modest distribution shift, even with plausible chains. If you have results where the transformation family is provably excluded from training and a large model still shows robust CoT, please share them. Otherwise this paper’s claim stands for the regime it tests.

3 more replies

razodactyl9mo ago

A viable consideration is that the models will hone in on and reinforce an incorrect answer - a natural side effect of the LLM technology wanting to push certain answers higher in probability and repeat anything in context.

Regardless of being in conversation or thinking context this doesn't prevent the model from speaking the wrong answer so the paper on the illusion of thinking makes sense.

What actually seems to be happening is a form of conversational prompting. Of course with the right conversation back and forth with an LLM you can inject knowledge in a way that causes the natural distribution to shift (again - side effect of the LLM tech.) but by itself it won't naturally get the answer perfect every time.

If this extended thinking were actually working you would expect the LLM to be able to logically conclude an answer with very high accuracy 100% of the time which it does not.

dcre9mo ago

The other commenter is more articulate, but you simply cannot draw the conclusion from this paper that reasoning models don't work well. They trained tiny little models and showed they don't work. Big surprise! Meanwhile every other piece of evidence available shows that reasoning models are more reliable at sophisticated problems. Just a few examples.

- https://arcprize.org/leaderboard

- https://aider.chat/docs/leaderboards/

- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...

Surely the IMO problems weren't "within the bounds" of Gemini's training data.

robrenaud9mo ago

The Gemini IMO result used a specifically fine tuned model for math.

Certainly they weren't training on the unreleased problems. Defining out of distribution gets tricky.

simianwords9mo ago

>The Gemini IMO result used a specifically fine tuned model for math.

This is false.

https://x.com/YiTayML/status/1947350087941951596

This is false even for the OpenAI model

https://x.com/polynoamial/status/1946478250974200272

"Typically for these AI results, like in Go/Dota/Poker/Diplomacy, researchers spend years making an AI that masters one narrow domain and does little else. But this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques."

Workaccount29mo ago

Every human taking that exam has fine tuned for math, specifically on IMO problems.

simianwords9mo ago

This is not the slam dunk you think it is. Thinking longer genuinely provides better accuracy. Sure there are decreasing returns to increasing thinking tokens.

GPT 5 fast gets many things wrong but switching to the thinking model fixes the issues very often.

p1esk9mo ago

They experimented with gpt-2 scale models. Hard to make any meaningful conclusions in the gpt-5 era.

j / k navigate · click thread line to collapse

0 comments

hodgehog119mo ago

I keep wondering whether people have actually examined how this work draws its conclusions before citing it.

ipaddr9mo ago

"This is science at its worst, where you start at an inflammatory conclusion and work backwards"

Science starts with a guess and you run experiments to test.

hodgehog119mo ago

I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.

lossolo9mo ago

3 more replies

razodactyl9mo ago

Regardless of being in conversation or thinking context this doesn't prevent the model from speaking the wrong answer so the paper on the illusion of thinking makes sense.

If this extended thinking were actually working you would expect the LLM to be able to logically conclude an answer with very high accuracy 100% of the time which it does not.

dcre9mo ago

- https://arcprize.org/leaderboard

- https://aider.chat/docs/leaderboards/

- https://arstechnica.com/ai/2025/07/google-deepmind-earns-gol...

Surely the IMO problems weren't "within the bounds" of Gemini's training data.

robrenaud9mo ago

The Gemini IMO result used a specifically fine tuned model for math.

Certainly they weren't training on the unreleased problems. Defining out of distribution gets tricky.

simianwords9mo ago

>The Gemini IMO result used a specifically fine tuned model for math.

This is false.

https://x.com/YiTayML/status/1947350087941951596

This is false even for the OpenAI model

https://x.com/polynoamial/status/1946478250974200272

Workaccount29mo ago

Every human taking that exam has fine tuned for math, specifically on IMO problems.

simianwords9mo ago

This is not the slam dunk you think it is. Thinking longer genuinely provides better accuracy. Sure there are decreasing returns to increasing thinking tokens.

GPT 5 fast gets many things wrong but switching to the thinking model fixes the issues very often.

p1esk9mo ago

They experimented with gpt-2 scale models. Hard to make any meaningful conclusions in the gpt-5 era.

j / k navigate · click thread line to collapse