So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.
It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.
But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.
And then the real clincher is this. LLMs naturally have a skill gap between their judgement and generation skills as is. The reason is that they have superhuman pattern matching and memorization ability. They can use their memorized patterns as a massive crutch for their actual reasoning skills, but they can't do the same for judgement calls in code review.
It's not hard. You are visiting a website with an .ai domain. You already know what the conclusions will be.
I had some fun trying to answer it, ignoring fixating on whether or not the premise is true, for argument's sake.
My answer is:
I would think "attempting to assess reality that is not grounded in reality" is hard to ignore due to a combination of "it's what is available," being easy to understand, and seeming useful (decoupled from whether it's really so). As a result, it's hard to ignore because it's what is mostly available to us for consumption and is easy to make "consumable."
I think there is a LARGE overlap in this topic with my pet peeve and hatred of mock tests in development. They are not completely useless, but their obvious flaws and vulnerabilities seem to me to be in the same area: "Not grounded in reality."
Said another way: Because it's what's easy to make, and thus there is a lot of it, creating a positive feedback loop of mere-exposure effect. Then it becomes hard to ignore because it's what's shoved in our face.
This is key.
Public benchmarks are essentially trust-based and the trust just isn't there.
With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.
And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.
the image shows it with a score of 62.7, not 58.5
which is right? mistakes like this undermine the legitimacy of a closed benchmark, especially one judged by an LLM
IME the highest value (at the moment) is having an LLM integrated into the PR page, that reads your code + CI log, and effectively operates as a sanity check / semantic linter.
A common workflow for us: is Draft PR -> Passes CI (inclusive of an LLM 'review') -> Published -> Passes Human review -> Scheduled to merge
The goal is to get a higher margin of confidence that your code (1) will not blow up in production (2) faithfully does what it's trying to do.
The value of the LLM reviewer is maybe 80% in the first bucket and 20% in the second bucket, IME. It often catches bugs like "off by one" and "you meant this to be `if not x`, based on the flag name and behavior, not `if x`".
I haven't found it to be really useful so far, but it's also very little added work, so for now I keep on using it. If it saves my ass even just once, it will probably be worth it overall.
That's a common fallacy of safety by the way :)
It could very well "save your ass" just once (whatever that means) while costing you more in time, opportunity, effort, or even false sense of safety, to generate more harm than it will ultimately save you.
Resulted in the following answers:
- Gemini 2.5 flash: Gemini 2.5 Flash
- Claude Sonnet 4: Claude Sonnet 4
- Chat GPT: GPT-5
To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*
The self-preference is almost certainly coming from post-processing, or more likely because the model name is inserted into the system prompt.
1) Gemini 2.5 Pro rank only non-google models 2) Claude 4.1 Opus rank only non-Anthropic models 3) GPT5-thinking rank only non-OpenAI 4) Then sum up the rankings and sort by the sum.
thanks, but no thanks, I don't buy such marketing propaganda.
It’d be harder to juice benchmarks if a random sample of ~100 top models were randomly sampled in this manner for output tokens while evaluating the target model’s output.
On second thought, I’m slapping AGPL on this idea. Please hire me and give me one single family house in a California metro as a bonus. Thanks.
Example: Given a long travel journal How many cities does the author mention? GPT-5: 12 Expected: 17
I've only seen it go above 5000 for very difficult style transfer problems where it has to wrangle with the micro-placement of lots of text. Or difficult math problems.
The sentence is too obviously LLM generated, but whatever.
> Weaknesses:
>
> False positives: A few reviews include incorrect or harmful fixes.
> Inconsistent labeling: Occasionally misclassifies the severity of findings or touches forbidden lines.
> Redundancy: Some repetition or trivial suggestions that dilute review utility.
wtf are "forbidden lines"?