https://marginlab.ai/trackers/claude-code-historical-perform...
But... Are you really going to completely rely on benchmarks that have time and time again be shown to be gamed as the complete story?
My take: It is pretty clear that the capacity crunch is real and the changes they made to effort are in part to reduce that. It likely changed the experience for users.
Moreover, on the companion codex graphs (https://marginlab.ai/trackers/codex-historical-performance/), you can see a few different GPT model releases marked yet none correspond to a visual break in the series. Either GPT 5.4-xhigh is no more powerful than GPT 5.2, or the benchmarking apparatus is not sensitive enough to detect such changes.
That feels like a concession to the limited benchmarking framework. 5.4-xhigh is supposed to be (and is widely believe to be) a better model than 5.2, so if that's invisible in the benchmarking scores then the protocol has problems. The test probably should include cases that should be 'easy passes' or 'near always failures', and then paired testing could offer greater precision on improvements or degradations.
Conversely, if model providers also don't do this then they could be accidentally 'benchmaxxing' if they use protocols like this to set dynamic quantization levels for inference. All you really need for a credible observation of problems from 'less intensive use' is a problem domain that isn't well-covered by the measured and monitored benchmark.
How is it fine?