For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI.
For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie.
Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors.
According to the paper:
> 1. Solution leak: represents instances where the solution to the issue is clearly outlined in the issue description or comments on GitHub. Since both the issue descriptions and comments (referred to as hints_text in the SWE-Bench study) are provided as input to the models, these LLM models can extract the solutions directly from this information instead of generating it independently.
And yet, the SWE-Bench authors themselves explicitly state:
> In short, for participating on the SWE-bench leaderboard, using hints_text in any manner is not allowed. Although we don't explicitly say this in the original paper, we also do not make any mention of using the hints_text anywhere.
So, it's a made up issue that would only occur if you deviated from the paper implementation and explicitly added a field called "hints" that isn't used anywhere.
[1] Don't ask me why they cited the issue number, 16669, instead of the pull request number, 16766, when only the latter appears in the dataset. This confused me for a bit.
Although I agree with your analysis and it doesn't look great for the authors, this issue (https://code.djangoproject.com/ticket/32517) arguably falls into their "Solution leak" category anyways, as the following text appears in the issue description (and so I think directly in `problem_statement` rather than `hints_text`):
> Currently, OrderedSet isn't reversible (i.e. allowed to be passed as an argument to Python's reversed()). This would be natural to support given that OrderedSet is ordered. This should be straightforward to add by adding a __reversed__() method to OrderedSet.
It isn't the exact code though, so I suppose it could be argued instead that the issue is just extremely easy.
I've been playing around with some automated code review tools recently, and it's surprising how often they flag things that are technically correct but just... unusual. Style matters, especially for maintainability.
IMHO, it is probably better to discard this paper, and wait for someone else to cover this important topic.
This matches my intuition about the coding performance of these models a lot better. I don't think any current coding benchmark accurately measures coding performance.
In my case, I would guess less than 10% of the code I get out of AIs is useful.
What sort of code are you getting those results with? Is it yet-another-react-frontend-button? Is it ebpf programs? Is it a parser in rust?
For the latter two, I've found AI to have pretty low rates, and for the former I haven't had the desire to try.
If you don’t know what you’re doing, these things can sometimes produce good code, and sometimes produce things that don’t work at all
Matches my experience pretty well as too. It'll usually output something that a novice would assume is correct but an expert can clearly identify as "know it all teenager forum post" level stuff.
It also goes to how a lot of people misunderstand the replication crisis. 'Hard science' really should replicate - we should be able filter out sources fo error and variance because the phenomena (generally) isn't affected by our attempts to measure it. Making social science replicate often requires so much control that it is deabstracted from reality, meaning the effort at replication reduces the value and usefulness of the knowledge. Generalizable claims are hard because the sources of variance are so much larger adn more complex. Speaking as someone who went through a transition from engineering to social sciences, it is the concept that made it hard. I started my time in social sciences with a cool idea of a whole carrer based on just doing replication studies, because science. That was...useful and stupid at the same time.
I find the models very useful to chat about library documentation or high level algorithm concepts, but I find the code it generates to be… I don’t know how else to say it… really bad and often out of context.
I know developers who blindly follow the hype and use them to generate production code. That scares the poop emoji out of me, and the code reads like an asset flipped 3D game.
OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.
They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.
Not sure that's really the claim. I think they claim that performance on benchmarks like GPQA indicate PhD level knowledge of different fields.
1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?
2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility.
3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.
I looked at a bunch of issues in the dataset when SWE-verified first game out and I was trying to make scaffolding to solve it and I don't remember a single time where the solution existed verbatim in the issue. I'm not saying it never happens, but it would have to be rare.
> 2. Are the issues locked after they’re included in the dataset?
No one changes the issues in the dataset but of course the original issue on github will have been resolved long ago. The models don't have access to this in their context, but if they were trained on github there's a very real risk that they've seen the solution.
> 3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.
The tests aren't provided to the model, they are run after the model has proposed its final answer.
> Whether we consider the issue description to be underspecified and hence unfair to be testing on. > Whether the FAIL_TO_PASS unit tests filter out valid solution
and a bit more. This is pointed out in the linked paper too.
The moral of the story to me is that, don't believe the paid human annotator. You can (hopefully) still believe the PhD students doing these unpaid jobs as their research ;-)
[1] https://openai.com/index/introducing-swe-bench-verified/
If anyone can find a better title (i.e. more accurate and neutral, preferably using language from the article itself) we can change it again.
Every quarter, you have a couple thousand volunteers provide 2 GitHub issues from the past 3 months, which are nontrivial to resolve, and where there exists strong test cases. Each volunteer then cross-checks 2 issues from other volunteers. The volunteers get 1 month free subscription to some AI service in return.
This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.
1) No known solutions, so there's no "ground truth" dataset to train on
2) Presumably hard to solve
3) But easy to verify a solution if one is provided.
This, of course, is easier done on the STEM side of things, but how do you automatically test creativity, or philosophical aptitude?
Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?
anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too
It's so vital that it's not leaked and that it's fit-for-purpose and manually assessed. These general purpose, public benchmarks based on questionable metrics are effectively worthless to assess real programming skill.
Case in point, as others have mentioned here, Claude scores modestly on these benchmarks but vastly better than the alternatives in practice. I don't trust Claude fully but far more than OpenAI models; it's not even close. The IRL performance advantage is not reflected in any of these benchmarks.
Instead of resolving it, some leaders are further complicating their meaning
Such as OpenAI grading their benchmarks based on "how much money they made" or "how easy a model was convinced to hand over fake money".
I always tell my customers to ignore benchmarks and compare outcomes with their own workloads. Benchmarks are almost completely useless in the real world.
Or, as in the case of LLMs and benchmarks: When a benchmark becomes a target, it ceases to be a good benchmark.
This is fine, many of my real tickets already explain the solution. A good ticket often offers a solution or where to start looking.
To me the analysis of SWE-Bench is a solid contribution and informative. My guess is that to meet conference's submission bar they had to come up with their own bench (SWE-Bench+), which wasn't thorough enough and the paper got rejected mainly because of that.
Is this what Hofstadter means by a strange-loop?