71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.
But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.
Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.
If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.
https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...
Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:
Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable
I.e. the agent cannot even know which tests are failing.
It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.
For this reason I find the benchmark a little disconnected from the reality of software engineering.
Another approach might be the LiveBench approach where new tests are released on a regular basis.
I could understand focusing on a niche business use case, but coding is a main focus of the foundation models themselves.
I think that the next step is getting an official "checked" mark by the SWE bench team
I do not want to pay API charges or be limited to a fixed number of "credits" per month.
I updated to the latest version last night. Enjoyed seeing the process permission toggle (rwx). Was a refreshing change to keep the security minded folks less in panic with all the agentic coding adoptions :-)
The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.
https://news.ycombinator.com/item?id=44833929, my comment https://news.ycombinator.com/item?id=44835939