Some critical issues with the SWE-bench dataset (opens in new tab)

(arxiv.org)

350 pointsjoshwa1y ago116 comments

116 comments

Some of the examples in the paper seem to be wrong.

For django-31056, they claim the AI-generated patch is "incomplete" because it's "missing critical parts of this logic, such as the try-except block and the check for a running event loop.". But if you look at the diff, that's clearly wrong. The try-except block and running check were already there before the patch. The human patch just indented them, making them appear as both - and +, while the AI patch didn't. To me, the AI patch seems correct. It's slightly less efficient than the human patch when DJANGO_ALLOW_ASYNC_UNSAFE is set, but slightly more efficient when it isn't (which is the common case!). The human patch does feel more natural, but the AI patch is fine. I'd grade it a tie between human and AI.

For django-32517, they claim that the human and AI patches "produce entirely different outputs", but actually they do exactly the same thing. The human version has `reversed(self.dict)`, while the AI version has `reversed(self.dict.keys())`. `reversed` treats the object as an iterator, and iterating over a dictionary in Python just gives you the keys, so it doesn't matter whether you call `.keys()` first. The human patch is more idiomatic, but it's also more confusing, as shown by the fact that it confused the authors of this paper. I'd grade it another tie.

Edit: I tried to sign up for OpenReview so I could leave a comment about this, but the system wouldn't let me register without completing a form that assumes you have an academic position. Perhaps I should email the authors.

fourpostmaun21y ago

The entire premise of this paper is false. They claim that the "hints_text" is used and leaks the answer in Section 2.1.1; however, the authors of SWE-Bench themselves state that this is not used anywhere (Issue #133 on the official SWE-Bench GitHub).

According to the paper:

> 1. Solution leak: represents instances where the solution to the issue is clearly outlined in the issue description or comments on GitHub. Since both the issue descriptions and comments (referred to as hints_text in the SWE-Bench study) are provided as input to the models, these LLM models can extract the solutions directly from this information instead of generating it independently.

And yet, the SWE-Bench authors themselves explicitly state:

> In short, for participating on the SWE-bench leaderboard, using hints_text in any manner is not allowed. Although we don't explicitly say this in the original paper, we also do not make any mention of using the hints_text anywhere.

So, it's a made up issue that would only occur if you deviated from the paper implementation and explicitly added a field called "hints" that isn't used anywhere.

comex1y ago

Hmm. For the example they give of solution leakage, sympy issue 16669 aka sympy__sympy-16766[1], the solution actually appears in problem_statement, so it seems to be genuine leakage. But you're right that they claim that hints_text is used, so they may have improperly winnowed out other instances where the solution only appears in hints_text.

[1] Don't ask me why they cited the issue number, 16669, instead of the pull request number, 16766, when only the latter appears in the dataset. This confused me for a bit.

throwaway0123_51y ago

> For django-32517

Although I agree with your analysis and it doesn't look great for the authors, this issue (https://code.djangoproject.com/ticket/32517) arguably falls into their "Solution leak" category anyways, as the following text appears in the issue description (and so I think directly in `problem_statement` rather than `hints_text`):

> Currently, OrderedSet isn't reversible (i.e. allowed to be passed as an argument to Python's reversed()). This would be natural to support given that OrderedSet is ordered. This should be straightforward to add by adding a __reversed__() method to OrderedSet.

It isn't the exact code though, so I suppose it could be argued instead that the issue is just extremely easy.

codelion1y ago

Interesting analysis! I hadn't dug into the specific patch details like that. It's a good reminder that "correctness" isn't always the only dimension to evaluate these AI-generated patches – readability and idiomatic style definitely matter too, even if the functional outcome is the same.

I've been playing around with some automated code review tools recently, and it's surprising how often they flag things that are technically correct but just... unusual. Style matters, especially for maintainability.

_cs2017_1y ago

I can only confirm two mistakes in the apper: 1) As you say, the reversed(self.dict) is actually correct; 2) as another poster below said, hints are not part of the input. These two mistakes are so egregious given the objective of the paper that I'm convinced the authors are not qualified to write it.

IMHO, it is probably better to discard this paper, and wait for someone else to cover this important topic.

petters1y ago

I think you should. Looks like there is more work to do

siva71y ago

The paper should be then retracted.

modeless1y ago

> When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.

This matches my intuition about the coding performance of these models a lot better. I don't think any current coding benchmark accurately measures coding performance.

OsrsNeedsf2P1y ago

Anecdotal but I was always shocked to see Claude 3.5 perform so poorly in the benchmarks, when it generates 80% of my code in Cursor (and in cases it fails, no other model succeeds)

TheDong1y ago

Different people seem to get wildly different results here, and I'm not sure what percentage is down to the type of software being built vs the usage patterns.

In my case, I would guess less than 10% of the code I get out of AIs is useful.

What sort of code are you getting those results with? Is it yet-another-react-frontend-button? Is it ebpf programs? Is it a parser in rust?

For the latter two, I've found AI to have pretty low rates, and for the former I haven't had the desire to try.

alfalfasprout1y ago

Almost every time someone says "but most of my code nowadays is LLM generated" it's usually one of three things:

1. Very greenfield work where the LLM doesn't really have a lot of constraints to deal with and can fully control the setup + doesn't have to ingest a lot of existing context 2. Very small projects that largely follow established patterns (CRUD, frontends, etc.) 3. Well established implementation work (the kind of feature that's a simple JIRA ticket).

In my experience they're painfully bad at:

- Novel/niche work where there aren't really answers online to what you're trying to do - Complex refactoring - Architecting within existing constraints (other systems, etc.)

2 more replies

Philip-J-Fry1y ago

I'm pretty confident in my ability to write any code in my main language. But AI is still very useful in just filling out boiler plate, or noticing a pattern and filling out the rest of some repetitive code. Or say, I need to write wrapper around a common command-line utility. It's pretty good at generating the code for that.

What I mostly enjoy using it for is just writing bash scripts for me. I hate writing bash but Claude is excellent at writing the scripts I need.

AI isn't writing software features or anything close to that for me at the moment. But what it is great at is just being a really excellent intellisense. Knowing what you're likely to want to do in the next ~5 lines and just filling it out in one button press. Things like intellisense and automatic refactoring tools were big productivity improvements when they became ubiquitous. AI will be the same for most people, an intellisense on steroids.

Also, writing tests. Writing tests can be quite mundane and boring. But I can just type out what I want tested, give it some files as context and it can be pretty good at generating some tests.

Does AI get it right every time? No way. But, as a developer, I'd rather spend 10 minutes trying to coax an AI into generating me 90% useable code for some boring task than spend 20 minutes typing it out myself. Often, I probably could write the code faster than I could prompt an AI, but being lazy and telling something else to do the work feels pretty good and relaxing.

1 more reply

ben_w1y ago

> For the latter two, I've found AI to have pretty low rates, and for the former I haven't had the desire to try.

Similar. I've got a joke language project on the back burner, doing it properly requires going back over my 23 year old university notes on yacc etc., so I tried AI… the AI just makes a mess of it*.

For anything front end, even the original ChatGPT-3.5 model is basically magic (i.e. sufficiently advanced technology).

* I think the last time I touched it was just before o1 was announced; as o3 is now in the free tier of ChatGPT, I should try again…

aprilthird20211y ago

My gut tells me the AIs will be best for small web projects that are greenfield. The kind a 1-3 person team could maintain.

And my gut tells me they are the worst for the kinds of long-established software conglomerates many professionals work at, which have tons of internal services, integrated acquisitions, etc. etc.

Ultimately the AI is good at what the average developer online is good at, probably full-stack web dev of projects from scratch.

2 more replies

duped1y ago

I think it's frontend javascript versus everything else.

There's a few languages/tools I use often but am not an expert in and have been using Claude 3.5 to help me work with existing code. On paper this is a perfect use case. In practice it's like working with an intern that has google in front of them and enough jargon to convince me what they're saying isn't bullshit. Eventually, I'll be able to coax the answers I need out of it.

I'll say though the fact AI can't say "I don't know" and closely related "that is not possible in the context you've given me" combined with the inability to reason is what gives you results that look OK but are subtly trash.

1 more reply

dgunay1y ago

I've been using LLMs for tab autocomplete for a while and just recently started trying out agentic coding AI (Copilot Edits and Cline). I think the disappointing shortfall of agentic AIs (at least for me) comes from the feedback loop being so much looser than the autocomplete style. With autocomplete, I don't have to actively think about what context to feed it, and I can gently correct it if it goes in the wrong direction on a line-by-line basis. With AI agents, they have a lot more leeway to generate a ton of code and reason themselves off the rails before you're able to step in and correct them. Now granted, I am also not very good yet at managing context and crafting prompts, but it feels a lot harder to get good at than simply dropping an AI autocompleter into an existing programming workflow. It's a new paradigm.

modeless1y ago

When I use Cursor I ask for specific small tasks that I know it should be able to handle. Larger, open-ended tasks fail almost universally for me.

jerpint1y ago

I think the big thing overlooked is how much the human steering the models matters. If you know what you’re doing and what changes you need, cursor and other tools make you so productive.

If you don’t know what you’re doing, these things can sometimes produce good code, and sometimes produce things that don’t work at all

rco87861y ago

That's been my experience too, but I would guess the problem of "here is a ton of context, produce a small amount of code" is significantly better suited for LLMs than "here is a problem, produce a ton of code".

serjester1y ago

I write a lot of Python and personally I find Claude significantly worse than OpenAI’s reasoning models. I really feel like this varies a ton language to language.

gcr1y ago

I've been quite underwhelmed at Copilot's suggestions. Is Claude all that better?

theturtletalks1y ago

I personally use Aider's Polyglot Benchmark [0] which is a bit low-key and not gamed just yet. It matches my experience too where Claude Sonnet 3.5 is the best and still beats the new reasoning models like o3-mini, DeepSeek, etc.

0. https://aider.chat/docs/leaderboards/

KaoruAoiShiho1y ago

Sonnet is literally lower on the aider benchmark you just linked. It's only the top with Deepseek as architect, otherwise it's lower than many others.

refulgentis1y ago

Let's steelman a bit: once you multiply out the edit accuracy versus completion accuracy, Sonnet, on its own, is within 5% of the very top one not using sonnet.

theturtletalks1y ago

Yes, but I use Cursor Composer Agent mode with Sonnet which is like Aider's architect mode where 1 LLM is instructing another one. Not to mention the new reasoning models can't use tool calling (except o3-mini which is not multi-modal).

1 more reply

nyrikki1y ago

Quite the corpus for Exercism tasks that were almost certainly trained on, which could lead this to doing what we know LLM/LRM's are good at...approximate retrieval.

https://github.com/search?q=Exercism&type=repositories

yunwal1y ago

Are Exercism coding exercises really low key? I thought it was like the standard free platform for learning a new language now

theturtletalks1y ago

Low-key as in many people don't check this leaderboard as much as the other high profile ones.

azinman21y ago

Would love if they put latency in this too.

delusional1y ago

> where the resolution rates of the models drop significantly, which are 0.73%, 0.55%, and 3.83%, respectively.

Matches my experience pretty well as too. It'll usually output something that a novice would assume is correct but an expert can clearly identify as "know it all teenager forum post" level stuff.

alfalfasprout1y ago

Yep anecdotally that's basically spot-on. It's also one of the reasons that I still find copilot vastly more useful than highly autonomous AI tooling (cursor, roocode, avante, etc.)

siva71y ago

o3-mini and gpt-4o are so piss poor in agent coding compared to claude that you don't even need a benchmark

jbellis1y ago

o3-mini-medium is slower than claude but comparable in quality. o3-mini-high is even slower, but better.

danielbln1y ago

Claude really is a step above the rest when it comes to agentic coding.

dr_kiszonka1y ago

When I used it with Open Hands it was great but also quite expensive (~$8/hr). In Trea, it was pretty bad, but free. Maybe it depends on how the agents use it? (I was writing the same piece of software, a simple web crawler for a hobby RAG project.)

avs7331y ago

It is worth reflecting, as much as HN seems to hate the social sciences, on this point. But the difficulty of measuring intelligence is a challenge that several fields have struggled with for decades. It is inherently hard because defining intelligence and building intelligence are very closely coupled. This both makes it hard to make unbiased measures as well making measures that don't affect the phenomenon basically NP hard, or known as the Flynn effect[0].

It also goes to how a lot of people misunderstand the replication crisis. 'Hard science' really should replicate - we should be able filter out sources fo error and variance because the phenomena (generally) isn't affected by our attempts to measure it. Making social science replicate often requires so much control that it is deabstracted from reality, meaning the effort at replication reduces the value and usefulness of the knowledge. Generalizable claims are hard because the sources of variance are so much larger adn more complex. Speaking as someone who went through a transition from engineering to social sciences, it is the concept that made it hard. I started my time in social sciences with a cool idea of a whole carrer based on just doing replication studies, because science. That was...useful and stupid at the same time.

[0] https://en.wikipedia.org/wiki/Flynn_effect

0x20cowboy1y ago

It matches my experience as well.

I find the models very useful to chat about library documentation or high level algorithm concepts, but I find the code it generates to be… I don’t know how else to say it… really bad and often out of context.

I know developers who blindly follow the hype and use them to generate production code. That scares the poop emoji out of me, and the code reads like an asset flipped 3D game.

bearjaws1y ago

I would argue almost every popular benchmark quoted by the big LLM companies is tainted.

OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.

They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

vonneumannstan1y ago

>They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

Not sure that's really the claim. I think they claim that performance on benchmarks like GPQA indicate PhD level knowledge of different fields.

AyyEye1y ago

That is the message, it's never usually stated in such a succinct and deniable way.

jandrese1y ago

Yeah, that's true in many fields with these AI agents. They demo well, but when you put them to actual work they fall right on their face. Even worse, the harder the task you set for them the more they lie to you. It's like hiring a junior dev from one of those highly regimented societies where it's more important to save face than to get the job done.

Xelynega1y ago

It's almost as if they're not trying to market to the people actually using the products, but trying to convince investors of features that don't exist

alfalfasprout1y ago

Yep it's "full self driving in 1 year" all over again.

ilrwbwrkhv1y ago

Its the good old Elon musk playbook spread out across the industry.

1 more reply

aprilthird20211y ago

Your last sentence feels kind of spot on. The lack of transparency around confidence in the answer makes it hard to use (and I know it would not be simple to add such a thing)

hackernewds1y ago

sounds like a skill issue to be honest. you could probably tell the assistant to just ask you questions when information is missing instead

3 more replies

washadjeffmad1y ago

To be totally fair, using PhD as a barometer of anything without specifying what is like claiming that LLMs have encyclopedic knowledge while meaning a children's encyclopedia.

hackernewds1y ago

The popular benchmarks are the ones that have already leaked. think about it

ukFxqnLa2sBSBf61y ago

There’s a few things I’m not understanding here.

1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?

2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility.

3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.

sebzim45001y ago

>1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?

I looked at a bunch of issues in the dataset when SWE-verified first game out and I was trying to make scaffolding to solve it and I don't remember a single time where the solution existed verbatim in the issue. I'm not saying it never happens, but it would have to be rare.

> 2. Are the issues locked after they’re included in the dataset?

No one changes the issues in the dataset but of course the original issue on github will have been resolved long ago. The models don't have access to this in their context, but if they were trained on github there's a very real risk that they've seen the solution.

> 3. For the agents writing patches, is test running part of their inner loop validation? If they write a patch that makes the test pass, then the jobs done. Or is that validation step kept secret from the agent? I don’t see how unless the tests aren’t part of the repo.

The tests aren't provided to the model, they are run after the model has proposed its final answer.

jbellis1y ago

Especially with swe-verified, I thought that was the whole point of that dataset

flakiness1y ago

This was also my first thought, but reading [1] again, what they did was labeling like:

> Whether we consider the issue description to be underspecified and hence unfair to be testing on. > Whether the FAIL_TO_PASS unit tests filter out valid solution

and a bit more. This is pointed out in the linked paper too.

The moral of the story to me is that, don't believe the paid human annotator. You can (hopefully) still believe the PhD students doing these unpaid jobs as their research ;-)

[1] https://openai.com/index/introducing-swe-bench-verified/

dang1y ago

Submitted title was "SWE-Bench tainted by answer leakage; real pass rates significantly lower". Normally we'd replace that with the article title, in keeping with the site guideline ("Please use the original title, unless it is misleading or linkbait; don't editorialize."), but in this case the article title is so generic that this is arguably misleading as well, so I took a representative phrase from the abstract instead. That's preferable, because it's better to use the authors' own representation of their article.

If anyone can find a better title (i.e. more accurate and neutral, preferably using language from the article itself) we can change it again.

https://news.ycombinator.com/newsguidelines.html

semi-extrinsic1y ago

So what we need is something like a versioned crowdsourced coding LLM eval dataset.

Every quarter, you have a couple thousand volunteers provide 2 GitHub issues from the past 3 months, which are nontrivial to resolve, and where there exists strong test cases. Each volunteer then cross-checks 2 issues from other volunteers. The volunteers get 1 month free subscription to some AI service in return.

This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.

delusional1y ago

And why would these "couple of thousand volunteers" help with this?

rsynnott1y ago

And how would you ensure that all of them were really volunteers and not colluding with the vendors? Like, tech companies cheating on benchmarks is an old, old story (personal favourite: in the dark ages, before 3D acceleration, some graphics card drivers, on detecting a 2D acceleration benchmark, would _simply draw the wrong thing_), and I wouldn’t trust at least three of the major players as far as I could throw them.

delusional1y ago

I'm pretty sure my bios still contains an option to "improve performance of 3dmark 8" or something similar.

nitwit0051y ago

If you know some way to get people to volunteer millions of dollars of free labor, there are better uses of their time than evaluating LLMs.

SR2Z1y ago

Right, so that AI companies can freely throw this significantly more valuable training data into a model and then turn around and advocate for clamping down on the freedom of models.

optimalsolver1y ago

You need benchmarks with the following three properties:

1) No known solutions, so there's no "ground truth" dataset to train on

2) Presumably hard to solve

3) But easy to verify a solution if one is provided.

This, of course, is easier done on the STEM side of things, but how do you automatically test creativity, or philosophical aptitude?

hsuduebc21y ago

I guess it's purely subjective. Maybe some internal commission if it comes to quality of creative work?

huac1y ago

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?

sebzim45001y ago

LLMs do not reliably reproduce their training data. This is quite easy to demonstrate, every LLM has been trained on all of wikipedia (at minimum) and yet there if you ask it a niche fact mentioned once on wikipedia it is highly likely to get it wrong.

feznyng1y ago

This is why I’m a bit skeptical of the o3 results. If it’s spending a bunch of time reasoning aren’t the chances of it simply regurgitating a solution it saw in its training data at some point in its output stream higher? It still needs to be clever enough to identify it as the correct answer but it’s not as impressive as an original solution.

sebzim45001y ago

I would guess that reasoning models would generalize better (i.e. have a smaller discrepency between stuff in the training set and stuff out of it) but it would be very interesting to check.

huac1y ago

that comment refers to the test time inference, i.e. what the model is prompted with, not to what it is trained on. this is, of course, also a tricky problem (esp over long context, needle in a haystack), but it should be much easier than memorization.

anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too

sebzim45001y ago

Then I don't understand what he's suggesting. It is obviously not the case that 1/3 of the questions int he SWE-bench dataset have the solution in as part of the issue that is provided to the model. You can just download it and look. The solution is likely in the training data though.

fooker1y ago

Larger llms do pretty well with this.

Smaller ones don't.

sebzim45001y ago

Large ones do better than small ones but still do worse than I would have expected before I tested them. E.g. `o1` doesn't know things which are repeated several times on wikipedia.

1 more reply

nraynaud1y ago

yeah, in the abstract they demoted the score from 12% to 3%, so sadly retirement is not yet here :(

perrygeo1y ago

The solution moving forward has to be private benchmark suites. I could see teams investing in their own set of programming challenges and periodically re-evaluating them - similar to how we would construct sets of live interview questions for candidates and qualitatively assess their ability.

It's so vital that it's not leaked and that it's fit-for-purpose and manually assessed. These general purpose, public benchmarks based on questionable metrics are effectively worthless to assess real programming skill.

Case in point, as others have mentioned here, Claude scores modestly on these benchmarks but vastly better than the alternatives in practice. I don't trust Claude fully but far more than OpenAI models; it's not even close. The IRL performance advantage is not reflected in any of these benchmarks.

brap1y ago

My own impression with SoTA models is that they’re very useful for coding, yet they suck ass for solving unique problems (which is the case for every sufficiently large codebase).

MattDaEskimo1y ago

There's a serious issue with benchmarks.

Instead of resolving it, some leaders are further complicating their meaning

Such as OpenAI grading their benchmarks based on "how much money they made" or "how easy a model was convinced to hand over fake money".

otterley1y ago

I am shocked—shocked—when a vendor cheats in order to increase their benchmark scores.

I always tell my customers to ignore benchmarks and compare outcomes with their own workloads. Benchmarks are almost completely useless in the real world.

Snuggly731y ago

I only trust benchmarks that I’ve faked myself :)

commandlinefan1y ago

Although I believe there's a lot of this going on, in this case it just appears to be incompetence rather than malice.

adamc1y ago

I don't know why you are getting downrated. That is sane advice.

1024core1y ago

To quote Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.

Or, as in the case of LLMs and benchmarks: When a benchmark becomes a target, it ceases to be a good benchmark.

OldGreenYodaGPT1y ago

> solutions were directly provided in the issue report or the comments

This is fine, many of my real tickets already explain the solution. A good ticket often offers a solution or where to start looking.

softwaredoug1y ago

Yep that's fine for an issue, but a problem if you're trying to eval whether AIs can solve coding problems.

ionwake1y ago

I was wondering how long this would take to surface, you can tell a surprising amount just by carefully watching how the trainers answer interview questions, which is kinda meta really.

shayanh1y ago

I found that this paper was submitted to ICLR, but got rejected: https://openreview.net/forum?id=pwIGnH2LHJ

To me the analysis of SWE-Bench is a solid contribution and informative. My guess is that to meet conference's submission bar they had to come up with their own bench (SWE-Bench+), which wasn't thorough enough and the paper got rejected mainly because of that.

vonneumannstan1y ago

Acceptance or rejection at big ML Conferences doesn't seem to carry much signal either way anymore. Completely saturated by grift and poor quality so each paper should be evaluated independent of their Conference status imo.

acc_2971y ago

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Is this what Hofstadter means by a strange-loop?

andrepd1y ago

Turns out "AI deep research reasoning agent" was just "we can print the training set"

thegeomaster1y ago

...by piping it through the world's most inefficient echo function.

sva_1y ago

That reminds me of someone calling the Bitcoin blockchain the most expensive linked list in the world.

1 more reply

xrd1y ago

You should immediately publish a paper on Arvix with your revolutionary IEF brand, an improvement on transformers and mamba architectures. Then, like Ilya, take $1B in funding the following week.

alalv1y ago

Something weird (or at least uncommon) that has caught my attention and I havent seen mentioned in the comments is that they cite the swe-bench paper author by first name in the abstract, Carlos et al, and then by last name (as it is usually done) in the paper, Jimenez et al.

htrp1y ago

Paper from October 2024

j / k navigate · click thread line to collapse

116 comments

comex1y ago

Some of the examples in the paper seem to be wrong.

fourpostmaun21y ago

According to the paper:

And yet, the SWE-Bench authors themselves explicitly state:

So, it's a made up issue that would only occur if you deviated from the paper implementation and explicitly added a field called "hints" that isn't used anywhere.

comex1y ago

[1] Don't ask me why they cited the issue number, 16669, instead of the pull request number, 16766, when only the latter appears in the dataset. This confused me for a bit.

throwaway0123_51y ago

> For django-32517

It isn't the exact code though, so I suppose it could be argued instead that the issue is just extremely easy.

codelion1y ago

_cs2017_1y ago

IMHO, it is probably better to discard this paper, and wait for someone else to cover this important topic.

petters1y ago

I think you should. Looks like there is more work to do

siva71y ago

The paper should be then retracted.

modeless1y ago

> When we filtered out these problematic issues, the resolution rate of SWE-Agent+GPT-4 dropped from 12.47% to 3.97%.

This matches my intuition about the coding performance of these models a lot better. I don't think any current coding benchmark accurately measures coding performance.

OsrsNeedsf2P1y ago

Anecdotal but I was always shocked to see Claude 3.5 perform so poorly in the benchmarks, when it generates 80% of my code in Cursor (and in cases it fails, no other model succeeds)

TheDong1y ago

Different people seem to get wildly different results here, and I'm not sure what percentage is down to the type of software being built vs the usage patterns.

In my case, I would guess less than 10% of the code I get out of AIs is useful.

What sort of code are you getting those results with? Is it yet-another-react-frontend-button? Is it ebpf programs? Is it a parser in rust?

For the latter two, I've found AI to have pretty low rates, and for the former I haven't had the desire to try.

alfalfasprout1y ago

Almost every time someone says "but most of my code nowadays is LLM generated" it's usually one of three things:

In my experience they're painfully bad at:

- Novel/niche work where there aren't really answers online to what you're trying to do - Complex refactoring - Architecting within existing constraints (other systems, etc.)

2 more replies

Philip-J-Fry1y ago

What I mostly enjoy using it for is just writing bash scripts for me. I hate writing bash but Claude is excellent at writing the scripts I need.

Also, writing tests. Writing tests can be quite mundane and boring. But I can just type out what I want tested, give it some files as context and it can be pretty good at generating some tests.

1 more reply

ben_w1y ago

> For the latter two, I've found AI to have pretty low rates, and for the former I haven't had the desire to try.

For anything front end, even the original ChatGPT-3.5 model is basically magic (i.e. sufficiently advanced technology).

* I think the last time I touched it was just before o1 was announced; as o3 is now in the free tier of ChatGPT, I should try again…

aprilthird20211y ago

My gut tells me the AIs will be best for small web projects that are greenfield. The kind a 1-3 person team could maintain.

And my gut tells me they are the worst for the kinds of long-established software conglomerates many professionals work at, which have tons of internal services, integrated acquisitions, etc. etc.

Ultimately the AI is good at what the average developer online is good at, probably full-stack web dev of projects from scratch.

2 more replies

duped1y ago

I think it's frontend javascript versus everything else.

1 more reply

dgunay1y ago

modeless1y ago

When I use Cursor I ask for specific small tasks that I know it should be able to handle. Larger, open-ended tasks fail almost universally for me.

jerpint1y ago

I think the big thing overlooked is how much the human steering the models matters. If you know what you’re doing and what changes you need, cursor and other tools make you so productive.

If you don’t know what you’re doing, these things can sometimes produce good code, and sometimes produce things that don’t work at all

rco87861y ago

serjester1y ago

I write a lot of Python and personally I find Claude significantly worse than OpenAI’s reasoning models. I really feel like this varies a ton language to language.

gcr1y ago

I've been quite underwhelmed at Copilot's suggestions. Is Claude all that better?

theturtletalks1y ago

0. https://aider.chat/docs/leaderboards/

KaoruAoiShiho1y ago

Sonnet is literally lower on the aider benchmark you just linked. It's only the top with Deepseek as architect, otherwise it's lower than many others.

refulgentis1y ago

Let's steelman a bit: once you multiply out the edit accuracy versus completion accuracy, Sonnet, on its own, is within 5% of the very top one not using sonnet.

theturtletalks1y ago

1 more reply

nyrikki1y ago

Quite the corpus for Exercism tasks that were almost certainly trained on, which could lead this to doing what we know LLM/LRM's are good at...approximate retrieval.

https://github.com/search?q=Exercism&type=repositories

yunwal1y ago

Are Exercism coding exercises really low key? I thought it was like the standard free platform for learning a new language now

theturtletalks1y ago

Low-key as in many people don't check this leaderboard as much as the other high profile ones.

azinman21y ago

Would love if they put latency in this too.

delusional1y ago

> where the resolution rates of the models drop significantly, which are 0.73%, 0.55%, and 3.83%, respectively.

Matches my experience pretty well as too. It'll usually output something that a novice would assume is correct but an expert can clearly identify as "know it all teenager forum post" level stuff.

alfalfasprout1y ago

Yep anecdotally that's basically spot-on. It's also one of the reasons that I still find copilot vastly more useful than highly autonomous AI tooling (cursor, roocode, avante, etc.)

siva71y ago

o3-mini and gpt-4o are so piss poor in agent coding compared to claude that you don't even need a benchmark

jbellis1y ago

o3-mini-medium is slower than claude but comparable in quality. o3-mini-high is even slower, but better.

danielbln1y ago

Claude really is a step above the rest when it comes to agentic coding.

dr_kiszonka1y ago

avs7331y ago

[0] https://en.wikipedia.org/wiki/Flynn_effect

0x20cowboy1y ago

It matches my experience as well.

I know developers who blindly follow the hype and use them to generate production code. That scares the poop emoji out of me, and the code reads like an asset flipped 3D game.

bearjaws1y ago

I would argue almost every popular benchmark quoted by the big LLM companies is tainted.

OAI, xAI, Antropic, Google all score incredibly well, then you go to try and write code and its just okay.

They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

vonneumannstan1y ago

>They claim it can do PHD level reasoning, but here I am not trusting it on basic computational thinking.

Not sure that's really the claim. I think they claim that performance on benchmarks like GPQA indicate PhD level knowledge of different fields.

AyyEye1y ago

That is the message, it's never usually stated in such a succinct and deniable way.

jandrese1y ago

Xelynega1y ago

It's almost as if they're not trying to market to the people actually using the products, but trying to convince investors of features that don't exist

alfalfasprout1y ago

Yep it's "full self driving in 1 year" all over again.

ilrwbwrkhv1y ago

Its the good old Elon musk playbook spread out across the industry.

1 more reply

aprilthird20211y ago

Your last sentence feels kind of spot on. The lack of transparency around confidence in the answer makes it hard to use (and I know it would not be simple to add such a thing)

hackernewds1y ago

sounds like a skill issue to be honest. you could probably tell the assistant to just ask you questions when information is missing instead

3 more replies

washadjeffmad1y ago

To be totally fair, using PhD as a barometer of anything without specifying what is like claiming that LLMs have encyclopedic knowledge while meaning a children's encyclopedia.

hackernewds1y ago

The popular benchmarks are the ones that have already leaked. think about it

ukFxqnLa2sBSBf61y ago

There’s a few things I’m not understanding here.

1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?

2. Are the issues locked after they’re included in the dataset? You’d think they would be immutable for reproducibility.

sebzim45001y ago

>1. Did the benchmark authors not review the issues and make sure the solution was not present in the issue?

> 2. Are the issues locked after they’re included in the dataset?

The tests aren't provided to the model, they are run after the model has proposed its final answer.

jbellis1y ago

Especially with swe-verified, I thought that was the whole point of that dataset

flakiness1y ago

This was also my first thought, but reading [1] again, what they did was labeling like:

> Whether we consider the issue description to be underspecified and hence unfair to be testing on. > Whether the FAIL_TO_PASS unit tests filter out valid solution

and a bit more. This is pointed out in the linked paper too.

The moral of the story to me is that, don't believe the paid human annotator. You can (hopefully) still believe the PhD students doing these unpaid jobs as their research ;-)

[1] https://openai.com/index/introducing-swe-bench-verified/

dang1y ago

If anyone can find a better title (i.e. more accurate and neutral, preferably using language from the article itself) we can change it again.

https://news.ycombinator.com/newsguidelines.html

semi-extrinsic1y ago

So what we need is something like a versioned crowdsourced coding LLM eval dataset.

This dataset is then published as SWE-UberBench-2025-02 or something. People can then only evaluate their coding LLM on datasets published after their training period.

delusional1y ago

And why would these "couple of thousand volunteers" help with this?

rsynnott1y ago

delusional1y ago

I'm pretty sure my bios still contains an option to "improve performance of 3dmark 8" or something similar.

nitwit0051y ago

If you know some way to get people to volunteer millions of dollars of free labor, there are better uses of their time than evaluating LLMs.

SR2Z1y ago

Right, so that AI companies can freely throw this significantly more valuable training data into a model and then turn around and advocate for clamping down on the freedom of models.

optimalsolver1y ago

You need benchmarks with the following three properties:

1) No known solutions, so there's no "ground truth" dataset to train on

2) Presumably hard to solve

3) But easy to verify a solution if one is provided.

This, of course, is easier done on the STEM side of things, but how do you automatically test creativity, or philosophical aptitude?

hsuduebc21y ago

I guess it's purely subjective. Maybe some internal commission if it comes to quality of creative work?

huac1y ago

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Looking at the benchmark, https://www.swebench.com/, about half of scored submissions score under 1/3 correct? So they're either not cheating, or not cheating effectively?

sebzim45001y ago

feznyng1y ago

sebzim45001y ago

I would guess that reasoning models would generalize better (i.e. have a smaller discrepency between stuff in the training set and stuff out of it) but it would be very interesting to check.

huac1y ago

anyways, another interpretation is that the model needs to also make a decision on if the code in the issue is a reliable fix or not too

sebzim45001y ago

fooker1y ago

Larger llms do pretty well with this.

Smaller ones don't.

sebzim45001y ago

Large ones do better than small ones but still do worse than I would have expected before I tested them. E.g. `o1` doesn't know things which are repeated several times on wikipedia.

1 more reply

nraynaud1y ago

yeah, in the abstract they demoted the score from 12% to 3%, so sadly retirement is not yet here :(

perrygeo1y ago

brap1y ago

My own impression with SoTA models is that they’re very useful for coding, yet they suck ass for solving unique problems (which is the case for every sufficiently large codebase).

MattDaEskimo1y ago

There's a serious issue with benchmarks.

Instead of resolving it, some leaders are further complicating their meaning

Such as OpenAI grading their benchmarks based on "how much money they made" or "how easy a model was convinced to hand over fake money".

otterley1y ago

I am shocked—shocked—when a vendor cheats in order to increase their benchmark scores.

I always tell my customers to ignore benchmarks and compare outcomes with their own workloads. Benchmarks are almost completely useless in the real world.

Snuggly731y ago

I only trust benchmarks that I’ve faked myself :)

commandlinefan1y ago

Although I believe there's a lot of this going on, in this case it just appears to be incompetence rather than malice.

adamc1y ago

I don't know why you are getting downrated. That is sane advice.

1024core1y ago

To quote Goodhart's Law: When a measure becomes a target, it ceases to be a good measure.

Or, as in the case of LLMs and benchmarks: When a benchmark becomes a target, it ceases to be a good benchmark.

OldGreenYodaGPT1y ago

> solutions were directly provided in the issue report or the comments

This is fine, many of my real tickets already explain the solution. A good ticket often offers a solution or where to start looking.

softwaredoug1y ago

Yep that's fine for an issue, but a problem if you're trying to eval whether AIs can solve coding problems.

ionwake1y ago

I was wondering how long this would take to surface, you can tell a surprising amount just by carefully watching how the trainers answer interview questions, which is kinda meta really.

shayanh1y ago

I found that this paper was submitted to ICLR, but got rejected: https://openreview.net/forum?id=pwIGnH2LHJ

vonneumannstan1y ago

acc_2971y ago

> 32.67% of the successful patches involve cheating as the solutions were directly provided in the issue report or the comments.

Is this what Hofstadter means by a strange-loop?

andrepd1y ago

Turns out "AI deep research reasoning agent" was just "we can print the training set"

thegeomaster1y ago

...by piping it through the world's most inefficient echo function.

sva_1y ago

That reminds me of someone calling the Bitcoin blockchain the most expensive linked list in the world.

1 more reply

xrd1y ago

You should immediately publish a paper on Arvix with your revolutionary IEF brand, an improvement on transformers and mamba architectures. Then, like Ilya, take $1B in funding the following week.

alalv1y ago

htrp1y ago

Paper from October 2024

j / k navigate · click thread line to collapse