undefined | Better HN

0 pointsdemirbey057mo ago0 comments

Seems LLMs really hit the wall.

0 comments

Before last year we didn't have reasoning. It came with QuietSTaR, then we got it in the form of O1 and then it became practical with DeepSeek's paper in January.

So we're only about a year since the last big breakthrough.

I think we got a second big breakthrough with Google's results on the IMO problems.

For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.

demirbey05OP7mo ago

IMO is not breakthrough, if you craft proper prompts you can excel imo with 2.5 Pro. Paper : https://arxiv.org/abs/2507.15855. Google just put whole computational power with very high quality data. It was test-time scaling. Why it didn't solve problem 6 as well?

Yes, it was breakthrough but saturated quickly. Wait for next breakthrough. If they can build adapting weights in llm we can talk different things but test time scaling coming to end with increasing hallucination rate. No sign for AGI.

impossiblefork7mo ago

It wasn't long ago that test-time scaling wasn't possible. Test-time scaling is a core part of what makes this a breakthrough.

I don't believe your assessment though. IMO is hard, and Google have said that they use search and some way of combining different reasoning traces, so while I haven't read that paper yet, and of course, it may support your view, but I just don't believe it.

We are not close to solving IMO with publicly known methods.

1 more reply

pton_xd7mo ago

Layman's perspective: we had hints of reasoning from the initial release of ChatGPT when people figured out you could prompt "think step by step" to drastically increase problem solving performance. Then yeah a year+ later it was cleverly incorporated into model training.

impossiblefork7mo ago

Fine, but to me reasoning is this the where you have <think> tags and use RL to decide what's to be generated in-between them.

Of course, people regarded things like GSM8k with trained reasoning traces as reasoning too, but it's pretty obviously not quite the same thing.

pxc7mo ago

We still don't have reasoning. We have synthetic text extrusion machines priming themselves to output text that looks a certain way by first generating some extra text that gets piped back into their own input for a second round.

relaytheurgency7mo ago

Right. The "reasoning" is an illusion. It's another hype generation tool.

1 more reply

nonhaver7mo ago

i think this is more an effect of releasing a model every other month with gradual improvements. if there was no o-series/other thinking models on the market - people would be shocked by this upgrade. the only way to keep up with the market is to release improvements asap

ModernMech7mo ago

I don't agree, the only thing thing that would shock me about this model is if it didn't hallucinate.

I think the actual effect of releasing more models every month has been to confuse people that progress is actually happening. Despite claims of exponentially improved performance and the ability to replace PhDs, doctors, and lawyers, it still routinely can't be trusted the same as the original ChatGPT, despite years of effort.

nonhaver7mo ago

this is a very odd perspective. as someone who uses LLMs for coding/PRs - every time a new model released my personal experience was that it was a very solid improvement on the previous generation and not just meant to "confuse". the jump from raw GPT-4 2 years ago to o3 full is so unbelievable if you traveled back in time and showed me i wouldn't have thought such technology would exist for 5+ years.

to the point on hallucination - that's just the nature of LLMs (and humans to some extent). without new architectures or fact checking world models in place i don't think that problem will be solved anytime soon. but it seems gpt-5 main selling point is they somehow reduced the hallucination rate by a lot + search helps with grounding.

1 more reply

satyrun7mo ago

Just an absurd statement when DeepSeek had its moment in January.

A whole 8 months ago.

manojlds7mo ago

And they said "it's over" millions of times. What they mean is the exponential expectations are done.

demirbey05OP7mo ago

I don't remember as a big fan of DeepSeek.

nonhaver7mo ago

you dont remember deepseek introducing reasoning and blowing benchmarks led by private american companies out of the water? with an api that was way cheaper? and then offered the model free in a chat based system online? and you were a big fan?

FergusArgyll7mo ago

Deepseek was never SOTA, it was a big deal because it was from China but it wasn't a breakthrough in any sense

missedthecue7mo ago

Isn't the fact that it produced similar performance about 70x more cheaply a breakthrough? In the same way that the Hall-Héroult process was a breakthrough. Not like we didn't have aluminum before 1886.

rustystump7mo ago

I think the llm wall was hit a while ago and the jumps have been around finessing llms in novel ways for a better result. But the core is still very much the same it has been for a while.

The crypto level hype claims are all bs and we all knew that but i do use an llm more than google now which is the there there so to speak.

This does feel like a flatlining of hype tho which is great because idk if i could take the ai hype train for much longer.

dismalaf7mo ago

It's seemed that way for the last year. The only real improvements have been in the chat apps themselves (internet access, function calling). Until AI gets past the pre-training problem, it'll stagnate.

amelius7mo ago

Is there a graph somewhere that illustrates it?

onlyrealcuzzo7mo ago

https://epoch.ai/data-insights/llm-apis-accuracy-runtime-tra...

It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.

This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.

How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?

bertili7mo ago

GPT5 doesn't add any cues to whether we hit the wall, as OpenAI only needs to go one step beyond the competition. They are market leaders and more profitable than the others, so it's possible are not showing us everything they have, until they really need to.

AstroBen7mo ago

..profitable you say?

demirbey05OP7mo ago

I mean test-time scaling coming to end, there are many open rooms for next thing.

hodgehog117mo ago

Not really, it's just that our benchmarks are not good at showing how they've improved. Those that regularly try out LLMs can attest to major improvements in reliability over the past year.

j / k navigate · click thread line to collapse

0 comments

impossiblefork7mo ago

Before last year we didn't have reasoning. It came with QuietSTaR, then we got it in the form of O1 and then it became practical with DeepSeek's paper in January.

So we're only about a year since the last big breakthrough.

I think we got a second big breakthrough with Google's results on the IMO problems.

For this reason I think we're very far from hitting a wall. Maybe 'LLM parameter scaling is hitting a wall'. That might be true.

demirbey05OP7mo ago

impossiblefork7mo ago

It wasn't long ago that test-time scaling wasn't possible. Test-time scaling is a core part of what makes this a breakthrough.

We are not close to solving IMO with publicly known methods.

1 more reply

pton_xd7mo ago

impossiblefork7mo ago

Fine, but to me reasoning is this the where you have <think> tags and use RL to decide what's to be generated in-between them.

Of course, people regarded things like GSM8k with trained reasoning traces as reasoning too, but it's pretty obviously not quite the same thing.

pxc7mo ago

relaytheurgency7mo ago

Right. The "reasoning" is an illusion. It's another hype generation tool.

1 more reply

nonhaver7mo ago

ModernMech7mo ago

I don't agree, the only thing thing that would shock me about this model is if it didn't hallucinate.

nonhaver7mo ago

1 more reply

satyrun7mo ago

Just an absurd statement when DeepSeek had its moment in January.

A whole 8 months ago.

manojlds7mo ago

And they said "it's over" millions of times. What they mean is the exponential expectations are done.

demirbey05OP7mo ago

I don't remember as a big fan of DeepSeek.

nonhaver7mo ago

FergusArgyll7mo ago

Deepseek was never SOTA, it was a big deal because it was from China but it wasn't a breakthrough in any sense

missedthecue7mo ago

rustystump7mo ago

I think the llm wall was hit a while ago and the jumps have been around finessing llms in novel ways for a better result. But the core is still very much the same it has been for a while.

The crypto level hype claims are all bs and we all knew that but i do use an llm more than google now which is the there there so to speak.

This does feel like a flatlining of hype tho which is great because idk if i could take the ai hype train for much longer.

dismalaf7mo ago

amelius7mo ago

Is there a graph somewhere that illustrates it?

onlyrealcuzzo7mo ago

https://epoch.ai/data-insights/llm-apis-accuracy-runtime-tra...

It is easier to get from 0% accurate to 99% accurate, than it is to get from 99% accurate to 99.9% accurate.

This is like the classic 9s problem in SRE. Each nine is exponentially more difficult.

How easy do we really think it will be for an LLM to get 100% accurate at physics, when we don't even know what 100% right is, and it's theoretically possible it's not even physically possible?

bertili7mo ago

AstroBen7mo ago

..profitable you say?

demirbey05OP7mo ago

I mean test-time scaling coming to end, there are many open rooms for next thing.

hodgehog117mo ago

Not really, it's just that our benchmarks are not good at showing how they've improved. Those that regularly try out LLMs can attest to major improvements in reliability over the past year.

j / k navigate · click thread line to collapse