Skimming through conclusions and results, the authors conclude that LLMs exhibit failures across many axes we'd find to be demonstrative of AGI. Moral reasoning, simple things like counting that a toddler can do, etc. They're just not human and you can reasonably hypothesize most of these failures stem from their nature as next-token predictors that happen to usually do what you want.
So. If you've got OpenClaw running and thinking you've got Jarvis from Iron Man, this is probably a good read to ground yourself.
Note there's a GitHub repo compiling these failures from the authors: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failur...
An LLM is more akin to interacting with a quirky human that has anterograde amnesia because it can't form long-term memories anymore, it can only follow you in a long-ish conversation.
I'm not arguing that LLMs are human here, just that your reasoning doesn't make sense.
They're sold as AGI by the cloud providers and the whole stock market scam will collapse if normies are allowed to peek behind the curtain.
Which LLMs? There's tons of them and more powerful ones appear every month.
Specifically, the idea that LLMs fail to solve some tasks correctly due to fundamental limitations where humans also fail periodically well may be an instance of the fundamental attribution error.
LOL. Finally the Techbro-CEOs succeeded in creating an AI in their own image.
I don't see any solution longer term other than more personalized models.
Which models? The last ones came out this week.
>Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits (Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures for practical tasks like temporal reasoning (Su et al., 2024).
This is very misleading and I think flat out wrong. What's the best way to falsify this claim?
Edit: I tried falsifying it.
https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...
https://chatgpt.com/share/6999b755-62f4-800b-912e-d015f9afc8...
I provided really hard 20 digit multiplications without tools. If you looked at the reasoning trace, it does what is normally expected and gets it right. I think this is enough to suggest that the claims made in the paper are not valid and LLMs do reason well.
To anyone who would disagree, can you provide a counter example that can't be solved using GPT 5 pro but that a normal student could do without mistakes?
This is not a valid experiment, because GPT models always have access to certain tools and will use them even if you tell them not to. They will fib the chain of thought after the fact to make it look like they didn't use a tool.
https://www.anthropic.com/research/alignment-faking
It's also well established that all the frontier models use python for math problems, not just GPT family of models.
Is that enough to falsify?
I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.
Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.
(Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)
IMO, symbolic AI is way too brittle and case-by-case to drive useful AI, but as a memory and reasoning system for more dynamic and flexible LLMs to call out to, it's a good idea.
I asked GPT to compute some hard multiplications and the reasoning trace seems valid and gets the answer right.
https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...