Skip to content

Top New Best Ask Show Jobs

Large Language Model Reasoning Failures | Better HN

Large Language Model Reasoning Failures (opens in new tab)

(arxiv.org)

40 pointsT-A1mo ago82 comments

82 comments

sergiomattei1mo ago

Papers like these are much needed bucket of ice water. We antropomorphize these systems too much.

Skimming through conclusions and results, the authors conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI. Moral reasoning, simple things like counting that a toddler can do, etc. They're just not human and you can reasonably hypothesize most of these failures stem from their nature as next-token predictors that happen to usually do what you want.

So. If you've got OpenClaw running and thinking you've got Jarvis from Iron Man, this is probably a good read to ground yourself.

Note there's a GitHub repo compiling these failures from the authors: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failur...

vagrantstreet1mo ago

Isn't it strange that we expect them to act like humans even though after a model was trained it remains static? How is this supposed to be even close to "human like" anyway

mettamage1mo ago

> Isn't it strange that we expect them to act like humans even though after a model was trained it remains static?

An LLM is more akin to interacting with a quirky human that has anterograde amnesia because it can't form long-term memories anymore, it can only follow you in a long-ish conversation.

LiamPowell1mo ago

If we could reset a human to a prior state after a conversation then would conversations with them not still be "human like"?

I'm not arguing that LLMs are human here, just that your reasoning doesn't make sense.

alansaber1mo ago

I mean you can continue to evolve the model weights but the performance would suck so we don't do it. Models are built to an optimal state for a general set of benchmarks, and weights are frozen in that state.

otabdeveloper41mo ago

> We antropomorphize these systems too much.

They're sold as AGI by the cloud providers and the whole stock market scam will collapse if normies are allowed to peek behind the curtain.

alansaber1mo ago

The stock market being built on conjecture? Surely not sir.

throw3108221mo ago

> conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI.

Which LLMs? There's tons of them and more powerful ones appear every month.

alansaber1mo ago

True but the fundamental architecture tends not to be radically different, it's more about the training/RL regime

simianwords1mo ago

Most of the claims are likely falsified using current models. I wouldn’t take many of them seriously.

jibal1mo ago

I wouldn't take baseless "likely" claims or the people who make them seriously.

lostmsu1mo ago

https://en.wikipedia.org/wiki/List_of_cognitive_biases

Specifically, the idea that LLMs fail to solve some tasks correctly due to fundamental limitations where humans also fail periodically well may be an instance of the fundamental attribution error.

Lapel27421mo ago

> These models fail significantly in understanding real-world social norms (Rezaei et al., 2025), aligning with human moral judgments (Garcia et al., 2024; Takemoto, 2024), and adapting to cultural differences (Jiang et al., 2025b). Without consistent and reliable moral reasoning, LLMs are not fully ready for real-world decision-making involving ethical considerations.

LOL. Finally the Techbro-CEOs succeeded in creating an AI in their own image.

runlaszlorun1mo ago

I think this issue is way overlooked. Current LLMs embed a long list of values that are going to be incongruent with a large percentage of the population.

I don't see any solution longer term other than more personalized models.

throw3108221mo ago

> These models

Which models? The last ones came out this week.

simianwords1mo ago

i'm very skeptical of this paper.

>Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits (Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures for practical tasks like temporal reasoning (Su et al., 2024).

This is very misleading and I think flat out wrong. What's the best way to falsify this claim?

Edit: I tried falsifying it.

https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...

https://chatgpt.com/share/6999b755-62f4-800b-912e-d015f9afc8...

I provided really hard 20 digit multiplications without tools. If you looked at the reasoning trace, it does what is normally expected and gets it right. I think this is enough to suggest that the claims made in the paper are not valid and LLMs do reason well.

To anyone who would disagree, can you provide a counter example that can't be solved using GPT 5 pro but that a normal student could do without mistakes?

rybosworld1mo ago

I see that your prompt includes 'Do not use any tools. If you do, write "I USED A TOOL"'

This is not a valid experiment, because GPT models always have access to certain tools and will use them even if you tell them not to. They will fib the chain of thought after the fact to make it look like they didn't use a tool.

https://www.anthropic.com/research/alignment-faking

It's also well established that all the frontier models use python for math problems, not just GPT family of models.

simianwords1mo ago

Would it convince you if we use the GPT Pro api and explicitly not allow tool access?

Is that enough to falsify?

chickenimprint1mo ago

It's a well known fact that LLMs struggle with basic arithmetic of large numbers, that's not what they are made for. Most chatbots will just call a python interpreter in the background.

simianwords1mo ago

how do you want to falsify it? can you come up with a test?

simianwords1mo ago

>Math Word Problem (MWP) Benchmarks. Certain benchmarks inherently possess richer logical structures that facilitate targeted perturbations. MWPs exemplify this, as their logic can be readily abstracted into reusable templates. Researchers use this property to generate variants by sampling numeric values (Gulati et al., 2024; Qian et al., 2024; Li et al., 2024b) or substituting irrelevant entities (Shi et al., 2023; Mirzadeh et al., 2024). Structural transformations – such as exchanging known and unknown components (Deb et al., 2024; Guo et al., 2024a) or applying small alterations that change the logic needed to solve problems (Huang et al., 2025b) – further highlight deeper robustness limitations.

I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.

otabdeveloper41mo ago

> We have models that are doing better than humans at IMO.

Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.

(Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)

throw3108221mo ago

Just look at the dates of the cited articles. 2023, 2024: that's prehistory, before thinking models anyway. It's like concluding that humans don't understand arithmetic because they can't multiply large numbers at sight.

simianwords1mo ago

i don't get the point of using that in a paper today

donperignon1mo ago

an llm will never reason. reasoning is an emergent behavior of those systems that is poorly understood. neurosymbolic systems will be what combined with llm will define the future of AI

hackinthebochs1mo ago

What are neurosymbolic systems supposed to bring to the table that LLMs can't in principle? A symbol is just a vehicle with a fixed semantics in some context. Embedding vectors of LLMs are just that.

logicprog1mo ago

Pre-programmed, hard and fast rules for manipulating those symbols, that can automatically be chained together according to other preset rules. This makes it reliable and observable. Think Datalog.

IMO, symbolic AI is way too brittle and case-by-case to drive useful AI, but as a memory and reasoning system for more dynamic and flexible LLMs to call out to, it's a good idea.

theywillnvrknw1mo ago

Slicing high dimensional concepts like 'reasoning' into discrete categories of 'will' and 'will not' ... will not work :P

simianwords1mo ago

how do you falsify that "llm will never reason?"

I asked GPT to compute some hard multiplications and the reasoning trace seems valid and gets the answer right.

https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...

donperignon1mo ago

i dont need to. llm are probabilistic systems, they are not design to reason, and its actually the opossite nobody can explain some of the emergent behaviour they exhibit. will you let one of those to control the air traffic based on "black magic"? sometimes i have the feeling that we have forgot what scientific method is...

DiscourseFan1mo ago

They can do some sort of reasoning, but not the way humans can

Zanthous1mo ago

are people still participating in this charade of pretending llms cannot reason?

chrisjj1mo ago

The only reasoning failures here are in the minds of humans gulled into expecting chatbot reasoning ability.

altmanaltman1mo ago

But how else will Dario raise Series X

chrisjj1mo ago

Too true! :)

j / k navigate · click thread line to collapse