undefined | Better HN

0 pointssolenoid09371mo ago0 comments

People complain about a lot of things. Claude has been fine:

https://marginlab.ai/trackers/claude-code-historical-perform...

0 comments

10 comments · 5 top-level

I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...

But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?

My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.

Majromax1mo ago· 2 in thread

While that's a nice effort, the inter-run variability is too high to diagnose anything short of catastrophic model degradation. The typical 95% confidence interval runs from 35% to 65% pass rates, a full factor of two performance difference.

Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.

yorwba1mo ago

Yes, MarginLab only tests 50 tasks a day, which is too few to give a narrower confidence interval. On the other hand, this really calls into question claims of performance degradation that are based on less intensive use than that. Variance is just so high that long streaks of bad luck are to be expected and plausibly the main source of such complaints. Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds (thus guaranteeing low variance), or you make a lot of calls (i.e. probably through the API and not in interactive mode.)

Majromax1mo ago

> Similarly, it's unlikely you can measure a significant performance difference between models like GPT 5.4-xhigh and GPT 5.2 unless you have a task where one of them almost always fails or one almost always succeeds

That feels like a concession to the limited benchmarking framework. 5.4-xhigh is supposed to be (and is widely believe to be) a better model than 5.2, so if that's invisible in the benchmarking scores then the protocol has problems. The test probably should include cases that should be 'easy passes' or 'near always failures', and then paired testing could offer greater precision on improvements or degradations.

Conversely, if model providers also don't do this then they could be accidentally 'benchmaxxing' if they use protocols like this to set dynamic quantization levels for inference. All you really need for a credible observation of problems from 'less intensive use' is a problem domain that isn't well-covered by the measured and monitored benchmark.

1 more reply

jofzar1mo ago

Matrix also found that Claude was AB testing 4.6 vs 4.7 in production for the last 12 days.

https://matrix.dev/blog-2026-04-16

cbg01mo ago· 3 in thread

That performance monitor is super easy to game if you cache responses to all the SWE bench questions.

solenoid0937OP1mo ago

You dramatically overestimate how much time engineers at hypergrowth startups have on their hands

dns_snek1mo ago

There's a direct business incentive to game/cheat benchmarks, it wouldn't even be difficult to do, and besides, they have workforce-replacing AI to do it for them.

1 more reply

cbg01mo ago

Caching some data is time consuming? They can just ask Claude to do it.

sumedh1mo ago

Your link shows there have been huge drops.

How is it fine?

j / k navigate · click thread line to collapse

0 comments

10 comments · 5 top-level

addisonj1mo ago

I will be the first to acknowledge that humans are a bad judge of performance and that some of the allegations are likely just hallucinations...

But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?

My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.

Majromax1mo ago· 2 in thread

yorwba1mo ago

Majromax1mo ago

1 more reply

jofzar1mo ago

Matrix also found that Claude was AB testing 4.6 vs 4.7 in production for the last 12 days.

https://matrix.dev/blog-2026-04-16

cbg01mo ago· 3 in thread

That performance monitor is super easy to game if you cache responses to all the SWE bench questions.

solenoid0937OP1mo ago

You dramatically overestimate how much time engineers at hypergrowth startups have on their hands

dns_snek1mo ago

There's a direct business incentive to game/cheat benchmarks, it wouldn't even be difficult to do, and besides, they have workforce-replacing AI to do it for them.

1 more reply

cbg01mo ago

Caching some data is time consuming? They can just ask Claude to do it.

sumedh1mo ago

Your link shows there have been huge drops.

How is it fine?

j / k navigate · click thread line to collapse