Benchmarking GPT-5 on 400 real-world code reviews (opens in new tab)

(qodo.ai)

72 pointsmarsh_mellow9mo ago80 comments

80 comments

> Each model’s responses are ranked by a high-performing judge model — typically OpenAI’s o3 — which compares outputs for quality, relevance, and clarity. These rankings are then aggregated to produce a performance score.

So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

raincole9mo ago

That's how 99% of 'LLM benchmark numbers' circulating on the internet work.

qsort9mo ago

No, they aren't. Most benchmarks use ground truth, not evaluation by another LLM. Using another LLM as verifier, aside from the obvious "quis custodiet custodes ipsos", opens an entire can of worms, such as the fact that there could be systematic biases in the evaluation. This is not in and of itself disqualifying but it should be addressed, and the article doesn't even say anything.

2 more replies

shikon79mo ago

Also, using an OpenAI model to judge the performance of an OpenAI model seems prone to all kinds of biases.

LauraMedia9mo ago

Am I missing something? If LLM-1 is supposed to judge LLM-2, doesn't LLM-1 have to be better than LLM-2? If LLM-1 is only 40% as good at coding as LLM-2, why would you trust the LLM with the lesser knowledge?

3 more replies

mirekrusin9mo ago

Exactly, they should at least compare with judges as best models from others, ideally verified by human/ground truth/tests.

ImageXav9mo ago

Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.

jacquesm9mo ago

I don't care about either method. The ground truth should be what a human would do, not what a model does.

1 more reply

spiderfarmer9mo ago

They are different models already but yes, I already let ChatGPT judge Claude's work for the same reason.

with9mo ago

It’s a widely accepted eval technique and it’s called “llm as a judge”

jacquesm9mo ago

Accepted does not mean correct. It's like using a rubber yardstick as the means to figure out who won the pumpkin growing competition.

1 more reply

kingstnap9mo ago

It's widely accepted because it's cheap, but LLMs aren't really good judges.

It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.

But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.

And then the real clincher is this. LLMs naturally have a skill gap between their judgement and generation skills as is. The reason is that they have superhuman pattern matching and memorization ability. They can use their memorized patterns as a massive crutch for their actual reasoning skills, but they can't do the same for judgement calls in code review.

sensanaty9mo ago

Accepted by whom, the people shoving AI down our throats?

1 more reply

magicalhippo9mo ago

Shouldn't one review the ratings of say a random 1% to ensure it's performing as expected?

dvfjsdhgfv9mo ago

> Hard to tell what to make of that.

It's not hard. You are visiting a website with an .ai domain. You already know what the conclusions will be.

eviks9mo ago

Why is it hard to ignore an attempt to assess reality that is not grounded in reality?

jtrn9mo ago

That's an extremely dense question :) (Not pejorative, but conceptual dense).

I had some fun trying to answer it, ignoring fixating on whether or not the premise is true, for argument's sake.

My answer is:

I would think "attempting to assess reality that is not grounded in reality" is hard to ignore due to a combination of "it's what is available," being easy to understand, and seeming useful (decoupled from whether it's really so). As a result, it's hard to ignore because it's what is mostly available to us for consumption and is easy to make "consumable."

I think there is a LARGE overlap in this topic with my pet peeve and hatred of mock tests in development. They are not completely useless, but their obvious flaws and vulnerabilities seem to me to be in the same area: "Not grounded in reality."

Said another way: Because it's what's easy to make, and thus there is a lot of it, creating a positive feedback loop of mere-exposure effect. Then it becomes hard to ignore because it's what's shoved in our face.

andrepd9mo ago

It's almost too on the nose to be satire, yet here we are.

croes9mo ago

It undermines the private benchmark approach if the evaluation is done that way.

timbilt9mo ago

> Unlike many public benchmarks, the PR Benchmark is private, and its data is not publicly released. This ensures models haven’t seen it during training, making results fairer and more indicative of real-world generalization.

This is key.

Public benchmarks are essentially trust-based and the trust just isn't there.

laggyluke9mo ago

Unless you're running the LLM yourself (locally), private benchmarks are also trust-based, aren't they?

timbilt9mo ago

Yes, but in a case like this it's a neutral third-party running the benchmark. So there isn't a direct incentive for them to favor one lab over another.

With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.

And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.

1 more reply

nojs9mo ago

How does this ensure models haven’t seen it during training - is it a different benchmark per model release?

jacquesm9mo ago

Then you just need to use different data the next time you evaluate. That is much more indicative of real-world generalization: after all, you don't normally do multiple PRs on the same pieces of code. The current approach risks leaking the dataset selectively and/or fudging the results because they can't be verified. Transparency is key when doing this kind of benchmark, so now we have to trust the entity doing the benchmarking rather than independent verification of the results and with the amount of money that is at stake here I don't think that's the way to go.

spongebobstoes9mo ago

> the “minimal” GPT-5 variant ... achieved a score of 58.5

the image shows it with a score of 62.7, not 58.5

which is right? mistakes like this undermine the legitimacy of a closed benchmark, especially one judged by an LLM

rs1869mo ago

A large chunk of this article reads like LLM generated, so I guess it was never proofread, and details like this are not validated, or they could be entirely made up i.e. hallucinated.

jama2119mo ago

Probably written by an llm too…

shinycode9mo ago

I’m curious to know how people use PR review platforms with LLMs. Because what I feel is that I need to do the review and then review the review of the LLM which is more work in the end. If I don’t review anymore (or if no one does it) knowledge is kind of lost. It surely depends on team size but do people use those to only to have better hints or to accelerate reviews with no/low overlook ?

gen2209mo ago

Disclosure: my current employer has a product in this space (graphite.dev)

IME the highest value (at the moment) is having an LLM integrated into the PR page, that reads your code + CI log, and effectively operates as a sanity check / semantic linter.

A common workflow for us: is Draft PR -> Passes CI (inclusive of an LLM 'review') -> Published -> Passes Human review -> Scheduled to merge

The goal is to get a higher margin of confidence that your code (1) will not blow up in production (2) faithfully does what it's trying to do.

The value of the LLM reviewer is maybe 80% in the first bucket and 20% in the second bucket, IME. It often catches bugs like "off by one" and "you meant this to be `if not x`, based on the flag name and behavior, not `if x`".

shinycode9mo ago

Thank you for the feedback, it answers my question of the fact that as of now it’s just an other step in a human review. Nothing fully automatic (which is reassuring in a way) it’s just an other step to review & validate

Leherenn9mo ago

Only has a sanity check/better hints. But I use it for my own PRs, not others'. Usually it's not much to review and easy to agree/disagree with.

I haven't found it to be really useful so far, but it's also very little added work, so for now I keep on using it. If it saves my ass even just once, it will probably be worth it overall.

fcantournet9mo ago

> If it saves my ass even just once, it will probably be worth it overall.

That's a common fallacy of safety by the way :)

It could very well "save your ass" just once (whatever that means) while costing you more in time, opportunity, effort, or even false sense of safety, to generate more harm than it will ultimately save you.

1 more reply

stpedgwdgfhgdd9mo ago

I give the MR id to CC and let it review. I have glab cli installed so it knows how to pull and even add a comment. Unfortunately not at all specific line number afaict. I also have Atlassian MCP, so CC can also add a comment in the Jira work item (fka issue).

8-prime9mo ago

Asking GPT 4o seems like an odd choice. I know this is not quite comparable to what they were doing, but asking different LLMs the following question > answer only with the name nothing more norting less.what currently available LLM do you think is the best?

Resulted in the following answers:

- Gemini 2.5 flash: Gemini 2.5 Flash

- Claude Sonnet 4: Claude Sonnet 4

- Chat GPT: GPT-5

To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*

monkeydust9mo ago

I know from our research models do exhibit bias when used this way as llm as a judge...best to use a totally different foundation company for the judge.

rullelito9mo ago

Without knowing too much about ML training, generated output from the own model must be much easier to understand since it generates data that is more likely to be similar to the training set? Is this correct?

jondwillis9mo ago

I don’t think so. The training data, or some other filter applied to the output tokens, is resulting in each model indicating that it is the best.

The self-preference is almost certainly coming from post-processing, or more likely because the model name is inserted into the system prompt.

qingcharles9mo ago

Someone else commented the same:

https://news.ycombinator.com/item?id=44834643

mkotlikov9mo ago

Models tend to prefer output that sounds like their own. If I were to run these benchmarks I would have:

1) Gemini 2.5 Pro rank only non-google models 2) Claude 4.1 Opus rank only non-Anthropic models 3) GPT5-thinking rank only non-OpenAI 4) Then sum up the rankings and sort by the sum.

tw19849mo ago

the conclusion of this post seems to be that GPT-5 is significantly better than o3, yet such conclusion is made by the exact far less reliable model o3 as proven by the tests in this post.

thanks, but no thanks, I don't buy such marketing propaganda.

dovin9mo ago

I don't consider myself a font snob but that web page was actually hard for me to read. Anyway, it's definitely capable according to my long-horizon text-based escape room benchmark. I don't know if it's significantly better than o3 yet though.

44za129mo ago

Can you benchmark Kimi K2 and GLM 4.5 as well? Would be interesting to see where they land.

highfrequency9mo ago

Great to see more private benchmarks. I would suggest swapping out the evaluator model from o3 to one of the other companies, eg Gemini 2.5 Pro, to make sure the ranking holds up. For example, if OpenAI models all share some sense of what constitutes good design, it would not be that surprising that o3 prefers GPT5 code to Gemini code! (I would not even be surprised if GPT5 were trained partially on output from o3).

jondwillis9mo ago

Idea: randomized next token prediction passed to a bunch of different models on a rotating basis.

It’d be harder to juice benchmarks if a random sample of ~100 top models were randomly sampled in this manner for output tokens while evaluating the target model’s output.

On second thought, I’m slapping AGPL on this idea. Please hire me and give me one single family house in a California metro as a bonus. Thanks.

thawab9mo ago

How can o4-mini be at 57, and sonnet-4 is at 39? This is way off, o4-mini is not even in the top 5 of coding agents.

gsandahl9mo ago

We are running task specific benchmarks across a number of categories (agentic tasks, context tasks, normalization tasks etc), and on our benchmarks we see Gpt-5 rating slightly below o3. But at a much lower cost.

See https://opper.ai/models

gsandahl9mo ago

Most of the tasks have assessed with ground truth, occasionally helped with an LLM as a judge to assess the answer if the answer is a sentence and not an exact result.

Example: Given a long travel journal How many cities does the author mention? GPT-5: 12 Expected: 17

rs1869mo ago

Serious question: how are these tools different from glorified system prompt generator?

thegeomaster9mo ago

Gemini 2.5 Pro is severely kneecapped in this evaluation. Limit of 4096 thinking tokens is way too low; I bet o3 is generating significantly more.

energy1239mo ago

For o3, I set reasoning_effort "high" and it's usually 1000-2000 reasoning tokens for routine coding questions.

I've only seen it go above 5000 for very difficult style transfer problems where it has to wrangle with the micro-placement of lots of text. Or difficult math problems.

XCSme9mo ago

The ranking seems wrong, Gemini-2.5flash as good as Clause Opus 4?

ascorbic9mo ago

And Sonnet above Opus?

rs1869mo ago

> GPT-5 stood out for its analytical strength and review clarity.

The sentence is too obviously LLM generated, but whatever.

> Weaknesses:

> False positives: A few reviews include incorrect or harmful fixes.

> Inconsistent labeling: Occasionally misclassifies the severity of findings or touches forbidden lines.

> Redundancy: Some repetition or trivial suggestions that dilute review utility.

wtf are "forbidden lines"?

Lionga9mo ago

Company selling AI Reviews says AI Reviews great! In other news water is wet.

carlob9mo ago

Company selling AI Reviews says its AI Review of AI Reviews concluded AI reviews are great! In other news water is wet (as assessed by more water).

FTFY

Lionga9mo ago

My AI Review says your comment is 100% perfect (this comment was written by ChatGPT 5)

grigio9mo ago

I don't trust benchmarks that do not include chinese models,..

j / k navigate · click thread line to collapse

80 comments

comex9mo ago

So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

raincole9mo ago

That's how 99% of 'LLM benchmark numbers' circulating on the internet work.

qsort9mo ago

2 more replies

shikon79mo ago

Also, using an OpenAI model to judge the performance of an OpenAI model seems prone to all kinds of biases.

LauraMedia9mo ago

3 more replies

mirekrusin9mo ago

Exactly, they should at least compare with judges as best models from others, ideally verified by human/ground truth/tests.

ImageXav9mo ago

Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.

jacquesm9mo ago

I don't care about either method. The ground truth should be what a human would do, not what a model does.

1 more reply

spiderfarmer9mo ago

They are different models already but yes, I already let ChatGPT judge Claude's work for the same reason.

with9mo ago

It’s a widely accepted eval technique and it’s called “llm as a judge”

jacquesm9mo ago

Accepted does not mean correct. It's like using a rubber yardstick as the means to figure out who won the pumpkin growing competition.

1 more reply

kingstnap9mo ago

It's widely accepted because it's cheap, but LLMs aren't really good judges.

It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.

But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.

sensanaty9mo ago

Accepted by whom, the people shoving AI down our throats?

1 more reply

magicalhippo9mo ago

Shouldn't one review the ratings of say a random 1% to ensure it's performing as expected?

dvfjsdhgfv9mo ago

> Hard to tell what to make of that.

It's not hard. You are visiting a website with an .ai domain. You already know what the conclusions will be.

eviks9mo ago

Why is it hard to ignore an attempt to assess reality that is not grounded in reality?

jtrn9mo ago

That's an extremely dense question :) (Not pejorative, but conceptual dense).

I had some fun trying to answer it, ignoring fixating on whether or not the premise is true, for argument's sake.

My answer is:

andrepd9mo ago

It's almost too on the nose to be satire, yet here we are.

croes9mo ago

It undermines the private benchmark approach if the evaluation is done that way.

timbilt9mo ago

This is key.

Public benchmarks are essentially trust-based and the trust just isn't there.

laggyluke9mo ago

Unless you're running the LLM yourself (locally), private benchmarks are also trust-based, aren't they?

timbilt9mo ago

Yes, but in a case like this it's a neutral third-party running the benchmark. So there isn't a direct incentive for them to favor one lab over another.

With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.

1 more reply

nojs9mo ago

How does this ensure models haven’t seen it during training - is it a different benchmark per model release?

jacquesm9mo ago

spongebobstoes9mo ago

> the “minimal” GPT-5 variant ... achieved a score of 58.5

the image shows it with a score of 62.7, not 58.5

which is right? mistakes like this undermine the legitimacy of a closed benchmark, especially one judged by an LLM

rs1869mo ago

A large chunk of this article reads like LLM generated, so I guess it was never proofread, and details like this are not validated, or they could be entirely made up i.e. hallucinated.

jama2119mo ago

Probably written by an llm too…

shinycode9mo ago

gen2209mo ago

Disclosure: my current employer has a product in this space (graphite.dev)

IME the highest value (at the moment) is having an LLM integrated into the PR page, that reads your code + CI log, and effectively operates as a sanity check / semantic linter.

A common workflow for us: is Draft PR -> Passes CI (inclusive of an LLM 'review') -> Published -> Passes Human review -> Scheduled to merge

The goal is to get a higher margin of confidence that your code (1) will not blow up in production (2) faithfully does what it's trying to do.

shinycode9mo ago

Leherenn9mo ago

Only has a sanity check/better hints. But I use it for my own PRs, not others'. Usually it's not much to review and easy to agree/disagree with.

I haven't found it to be really useful so far, but it's also very little added work, so for now I keep on using it. If it saves my ass even just once, it will probably be worth it overall.

fcantournet9mo ago

> If it saves my ass even just once, it will probably be worth it overall.

That's a common fallacy of safety by the way :)

1 more reply

stpedgwdgfhgdd9mo ago

8-prime9mo ago

Resulted in the following answers:

- Gemini 2.5 flash: Gemini 2.5 Flash

- Claude Sonnet 4: Claude Sonnet 4

- Chat GPT: GPT-5

To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*

monkeydust9mo ago

I know from our research models do exhibit bias when used this way as llm as a judge...best to use a totally different foundation company for the judge.

rullelito9mo ago

jondwillis9mo ago

I don’t think so. The training data, or some other filter applied to the output tokens, is resulting in each model indicating that it is the best.

The self-preference is almost certainly coming from post-processing, or more likely because the model name is inserted into the system prompt.

qingcharles9mo ago

Someone else commented the same:

https://news.ycombinator.com/item?id=44834643

mkotlikov9mo ago

Models tend to prefer output that sounds like their own. If I were to run these benchmarks I would have:

1) Gemini 2.5 Pro rank only non-google models 2) Claude 4.1 Opus rank only non-Anthropic models 3) GPT5-thinking rank only non-OpenAI 4) Then sum up the rankings and sort by the sum.

tw19849mo ago

the conclusion of this post seems to be that GPT-5 is significantly better than o3, yet such conclusion is made by the exact far less reliable model o3 as proven by the tests in this post.

thanks, but no thanks, I don't buy such marketing propaganda.

dovin9mo ago

44za129mo ago

Can you benchmark Kimi K2 and GLM 4.5 as well? Would be interesting to see where they land.

highfrequency9mo ago

jondwillis9mo ago

Idea: randomized next token prediction passed to a bunch of different models on a rotating basis.

It’d be harder to juice benchmarks if a random sample of ~100 top models were randomly sampled in this manner for output tokens while evaluating the target model’s output.

On second thought, I’m slapping AGPL on this idea. Please hire me and give me one single family house in a California metro as a bonus. Thanks.

thawab9mo ago

How can o4-mini be at 57, and sonnet-4 is at 39? This is way off, o4-mini is not even in the top 5 of coding agents.

gsandahl9mo ago

See https://opper.ai/models

gsandahl9mo ago

Most of the tasks have assessed with ground truth, occasionally helped with an LLM as a judge to assess the answer if the answer is a sentence and not an exact result.

Example: Given a long travel journal How many cities does the author mention? GPT-5: 12 Expected: 17

rs1869mo ago

Serious question: how are these tools different from glorified system prompt generator?

thegeomaster9mo ago

Gemini 2.5 Pro is severely kneecapped in this evaluation. Limit of 4096 thinking tokens is way too low; I bet o3 is generating significantly more.

energy1239mo ago

For o3, I set reasoning_effort "high" and it's usually 1000-2000 reasoning tokens for routine coding questions.

I've only seen it go above 5000 for very difficult style transfer problems where it has to wrangle with the micro-placement of lots of text. Or difficult math problems.

XCSme9mo ago

The ranking seems wrong, Gemini-2.5flash as good as Clause Opus 4?

ascorbic9mo ago

And Sonnet above Opus?

rs1869mo ago

> GPT-5 stood out for its analytical strength and review clarity.

The sentence is too obviously LLM generated, but whatever.

> Weaknesses:

> False positives: A few reviews include incorrect or harmful fixes.

> Inconsistent labeling: Occasionally misclassifies the severity of findings or touches forbidden lines.

> Redundancy: Some repetition or trivial suggestions that dilute review utility.

wtf are "forbidden lines"?

Lionga9mo ago

Company selling AI Reviews says AI Reviews great! In other news water is wet.

carlob9mo ago

Company selling AI Reviews says its AI Review of AI Reviews concluded AI reviews are great! In other news water is wet (as assessed by more water).

FTFY

Lionga9mo ago

My AI Review says your comment is 100% perfect (this comment was written by ChatGPT 5)

grigio9mo ago

I don't trust benchmarks that do not include chinese models,..

j / k navigate · click thread line to collapse