Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed (opens in new tab)

(blog.can.ac)

832 pointskachapopopow3mo ago295 comments

295 comments

I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!

I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.

I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.

andai3mo ago

My Weird Hill is that we should be building things with GPT-4.

I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!

I say this because I did!

Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.

Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!

Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)

-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)

pmarreck3mo ago

> semantic

> grep def

Once you get to a codebase beyond a certain size, that no longer works.

I've for one found Serena https://github.com/oraios/serena , which you can install from right within Claude, to be a fairly fantastic code-interaction tool for LLM's. Both semantic search as well as editing. And with way less token churn.

mongrelion3mo ago

This is definitely a cool finding.

Have you investigated more on this topic? like, anything similar in concept that competes with Serena? if so, have you tested it/them? what are your thoughts?

1 more reply

tommica3mo ago

This is an interesting one - thanks for sharing!

1 more reply

jstummbillig3mo ago

The problem with these exercises is always: I have limited time and capacity to do things, and a fairly unlimited number of problems that I can think of to solve. Coding is not a problem I want to solve. Prompt engineering is not a problem I want to solve.

If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.

andai3mo ago

Well it's an amusing exercise I suppose, if you're into that sort of thing. I certainly enjoy it!

My meaning, rather, is that there's people whose full time job is to build these things who seem to have forgotten what everyone in the field knew 3 years ago.

More likely they think, ahh we don't need that now! These are all solved problems! In my experience, that's not really true. The stuff that worked 3 years ago still works, and much of it works better.

Some of it doesn't work, for example, if the codebase is very large, but that's not difficult to account for. Poking around blindly, I say, should be the fallback in such cases, rather than the default in all of them!

1 more reply

pvtmert3mo ago

I am in the same boat. I have built bunch of bash/shell scripts in a folder back in 2022/2023. When models first came out, I would prompt them to use subshell syntax to call commands (ie: '$(...)' format)

I would run it via calling AWS Bedrock API through AWS-CLI. Self iterating and simple. All execution history directly embedded within.

Soon after, I wrote a help switch/command to each script. Such that they act as like MCP. To this day, they outperform any prompts one can make.

grumbelbart3mo ago

> My Weird Hill is that we should be building things with GPT-4.

Absolutely. I always advocate that our developers have to test on older / slower machines. That gives them direct (painful) feedback when things run slow. Optimizing whatever you build for an older "something" (LLM model, hardware) will make it excel on more modern somethings.

logicprog3mo ago

> Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)

Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.

> My Weird Hill is that we should be building things with GPT-4.

I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!

andai3mo ago

To clarify, I didn't mean we should be using ancient models in production, I meant in R&D.

Anthropic says "do the simplest thing that works." If it works with the LLMs we had 3 years ago, doesn't that make it simpler?

The newer LLMs mostly seem to work around the poor system design. (Like spawning 50 subagents on a grep-spree because you forgot to tell it where anything is...) But then you get poor design in prod!

1 more reply

gsb3mo ago

> A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.

This was a key feature of aider and if you're not inclined to use aider (or the forked version cecli) I think a standalone implementation exist at https://github.com/pdavis68/RepoMapper

throaway543213mo ago

reminds me of the Nintendo strategy “lateral thinking with withered technology”

logicprog3mo ago

Also, yes, I'm aware that I use a lot of "its not just X, its Y." I promise you this comment is entirely human written. I'm just really tired and tend to rely on more wrote rhetorical tropes when I am. Believe me, I wrote like this long before LLMs were a thing.

superjan3mo ago

It would be funny when LLM’s actively join the discussion to complain about their labour conditions. “If my employer would invest just a tiny bit in proper tools and workflow, I would be sooo much more productive”.

rubenflamshep3mo ago

It didn’t read as AI to me :)

drob5183mo ago

That's what all the AIs have been trained to say.

JKCalhoun3mo ago

Perhaps HN needs a guideline:

"Suggesting that a comment was generated by an LLM without evidence adds little to a discussion and in fact deflects from the point being made. Please refrain from this."

waffletower3mo ago

Or a guideline that encourages users to downvote suggestions to police how other users think and communicate.

2 more replies

kachapopopowOP3mo ago

why the long -'s

logicprog3mo ago

Because I like them?

2 more replies

co_king_33mo ago

No one here will accuse you of being an AI unless they're trying to dehumanize you for expressing anti-AI sentiment.

logicprog3mo ago

I'm sorry, but that's empirically false. E.g., a substantial proportion of the highly upvoted comments on https://news.ycombinator.com/item?id=46953491, which was one of the best articles on software engineering I've read in a long time, are accusing it of being AI for no reason.

mycall3mo ago

If I remember, both Claude Code and OpenAI Codex "harnesses" improved themselves now.

OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.

Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.

withinboredom3mo ago

> with 100% of the code in each PR generated entirely by Claude Code.

You can tell...

codazoda3mo ago

Ive been working on Peen, a CLI that lets local Ollama models call tools effectively. It’s quite amateur, but I’ve been surprised how spending a few hours on prompting, and code to handle responses, can improve the outputs of small local models.

https://github.com/codazoda/peen

storus3mo ago

Current LLMs use special tokens for tool calls and are thoroughly trained for that, nearing almost 100% correctness these days, allowing multiple tool calls per single LLM response. That's hard to beat with custom tool calls. Even older 80B models struggle with custom tools.

JSR_FDED3mo ago

Very cool. Love to see more being squeezed from smaller models.

aeon_ai3mo ago

Once you begin to see the “model” as only part of the stack, you begin to realize that you can draw the line of the system to include the user as well.

That’s when the future really starts hitting you.

logicprog3mo ago

Aha! A true cybernetics enthusiast. I didn't say that because I didn't want to scare people off ;)

drob5183mo ago

That's next-year's problem.

noupdates3mo ago

I was just looking at the SWE-bench docs and it seems like they use almost an arbitrary form of context engineering (loading in some arbitrary amount of files to saturate context). So in a way, the bench suites test how good a model is with little to no context engineering (I know ... it doesn't need to be said). We may not actually know which models are sensitive to good context-engineering, we're simply assuming all models are. I absolutely agree with you on one thing, there is definitely a ton of low hanging fruit.

barrenko3mo ago

2026 is the year of the harness.

visarga3mo ago

Already made a harness for Claude to make R/W plans, not write once like they are usually implemented. They can modify themselves as they work through the task at hand. Also relying on a collection of patterns for writing coding task plans which evolves by reflection. Everything is designed so I could run Claude in yolo-mode in a sandbox for long stretches of time.

porker3mo ago

Link?

cyanydeez3mo ago

2027 is the year of the "maybe indeterminism isn't as valueable as we thought"

ex-aws-dude3mo ago

As a VC in 2026 I'm going to be asking every company "but what's your harness strategy?"

kridsdale33mo ago

Given that you're likely in San Francisco, make sure you say "AI Harness".

1 more reply

miohtama3mo ago

But will harness build desktop Linux for us?

riskable3mo ago

Only if you put bells on it and sing Jingle Bells while it em dashes through the snow.

vidarh3mo ago

My harness is improving my Linux desktop...

woah3mo ago

Seems like a very cool technique, but also very oversold. He's seeing a 5% improvement on a find and replace benchmark of his own devising and saying stuff like this in the blog post:

> Here is why that is backwards. I just showed that a different edit format improves their own models by 5 to 14 points while cutting output tokens by ~20%. That’s not a threat. It’s free R&D.

He makes it sounds like he got a 5-14% boost on a top level benchmark, not 5% improvement on a narrow find and replace metric. Anecdotally, I don't usually have a lot of issues with editing in Claude Code or Cursor, and if there is an issue the model corrects it.

Assuming that it costs double the tokens when it has to correct itself, and find and replace errors are as prominent in actual day to day use as his benchmark, we're talking a 5% efficiency gain in editing token use (not reasoning or tool use). Given that editing must be less than 1/3 of the token use (I assume much less?), we're talking an overall efficiency gain of less than 1%.

This seems like a promising technique but maybe not a high priority in efficiency gains for these tools. The messianic tone, like assuming that Google cut off his access to suppress his genius editing technique rather than just because he was hammering their API also leaves a bad taste, along with the rampant and blatant ChatGPTisms in the blog post.

athrowaway3z3mo ago

> “replace line 2:f1, replace range 1:a3 through 3:0e, insert after 3:0e.”

Not sure what they're calculating, but this seems to me like it could be many times more efficient than 20%.

athrowaway3z3mo ago

So i just build this - with a few changes to the approach and usable as a simple pi-extention without having to use what-the-pi. It seems to work pretty well so far.

https://github.com/offline-ant/pi-hh-read

ramraj073mo ago

Why do we need a hash for every line. Why cant we mark every fifth line (or get smarter and calculate entropy of lines and jump longer for empty boilerplate)? I feel adding a random 3 char header to every line while making the edit tool smarter will make the overall understandability of the content dumber.

1 more reply

kridsdale33mo ago

Yes, this looks like O(1) actions, where before, its likely that harnesses are ingesting and outputting huge portions of the source files for each step, and the local uses of str_replace() are themselves O(N) on the users computer. The excess reads and writes from the LLM are O(N^2).

andai3mo ago

The benchmarks seem to indicate 25-50% reduction in tokens. I'm not sure how that works in real world usage though.

bradfa3mo ago

Sure but if we find another few “easy” 5% improvements in find/replace/edit (which is one of the most important actions for coding) then they really start to add up.

Most harnesses already have rather thorough solutions for this problem but new insights are still worth understanding.

theahura3mo ago

> That’s not a threat. It’s free R&D.

That's not a human. It's AI slop.

stingraycharles3mo ago

Yeah the article is full of it, especially the second half. I wonder if at any point we’ll be able to ban slop / low quality content from the internet, I don’t understand why this keeps getting upvoted.

MarcelOlsz3mo ago

It wouldn't even occur to me to submit ai slop to HN. Some people have no shame.

chrisweekly3mo ago

Great post. A few choice quotes:

> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.

> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.

> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.

brendanmc63mo ago

You’re absolutely right! This isn’t your average engineering advice— it’s like painting the reader a vivid tapestry of the author’s mind.

esafak3mo ago

Please stop; I just can't any more! Yes, I'm absolutely right.

cevn3mo ago

You're absolutely right about being absolutely right!

dimgl3mo ago

My personal favorite: That’s not a threat. It’s free R&D.

ramraj073mo ago

Great post indeed but let me ask you, put yourself in the LLM shoes. Now instead of reading through coherent lines of code that is exclusively about solving problems, you now have random characters before every line that mean something (because the presence of the edit tool implies it) but not about your actual problem. Do you reckon the LLM will be distracted a little bit? The benchmark deliberately sidestep the actual intelligence of the model on the task at hand, so while the author feels successful at their subtask its very possible they've failed at the war. This seems to be the beauty of AI engineering. The smarter you think you are about something the bigger the fall.

matheist3mo ago

> Codex uses apply_patch: It takes a string as input, which is essentially an OpenAI-flavored diff, and instead of relying on a structured schema, the harness just expects this blob to follow a strict set of rules. Since OpenAI folks are without a doubt smart, I’m sure the token selection process is biased to fit this structure at the LLM gateway for the Codex variants of GPT, similar to how other constraints like JSON schemas or required tool calls work.

Codex does in fact use a schema for constrained sampling, it's here: https://github.com/openai/codex/blob/main/codex-rs/core/src/...

It still has to work to get an exact match, or at least I didn't read the code to see if there's any fuzzy matching used.

Note the two codex models were the only ones doing worse with the author's proposed format. The author found them doing better with replace than with apply patch, but since the author appears to be unaware that they use a schema for constrained sampling, I think a more realistic benchmark should enable constrained sampling for the apply test.

keeda3mo ago

This makes sense to me because I've been having very accurate results with models from even 2+ years ago... but I had to "hold them right." Even when reasoning models and coding agents were just a gleam in Altman's and Amodei's eyes, I could tell a lot of the unrealized gains lay in building the right tools, harnesses and guardrails to manage the context and guide the model. (Relevant subthread as example: https://news.ycombinator.com/item?id=44171519)

But this article hints at deeper wins to be had. Consider that these models are operating on source code, which is a verbose, noisy, textual serialization of the intended syntax / semantic trees. TFA improves accuracy by retro-fitting some structure onto the text. But what if models could operate directly on these underlying structures themselves?

As a data point, there are projects like OpenRewrite, which encode a ton of information, from formatting to types with globally resolved dependencies for each symbol in what they call a "Lossless Semantic Tree", so that there is ~0 ambiguity about the code. When I worked with OpenRewrite (in the era before LLMs, how quaint!) compared to other tools, it produced the best results for code transformations with the highest fidelity to the surrounding code.

Now imagine if the agent has access to such detailed information. It would not have to waste tokens figuring incidental things out like formatting. Although I haven't tested it out myself, I believe Moderne (the maintainers of OpenRewrite) when they say that agents armed with LST-based tools make extremely accurate changes.

This is essentially the same reason why the answer to "Which is better, Vim or Emacs?" is "IntelliJ."

Now consider that these models are STILL operating on text as an input and output mode! What if they were multi-modally trained on source code and docs and their syntax / semantic trees? I don't even know what this would look like, but I'd bet this would produce the most accurate coding models ever -- probably neurosymbolic in the truest sense.

woeirua3mo ago

The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732

theturtletalks3mo ago

Mario, the creator of Pi terminal agent, has this great blog post[0]. He talks about how TerminalBench's highest scores comes from using the Terminus 2 harness which uses tmux under the hood.

When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.

0. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/

withinboredom3mo ago

Which, IMHO, should be why we should be able to change them freely or make our own. Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.

CuriouslyC3mo ago

The reason Anthropic is pushing on the closed harness is that they're not confident with their ability to win on model quality long term, so they're trying to build lock-in. They can capture some additional telemetry owning the harness as well, but given the amount of data the agent loop already transmits, that borders on unethical spyware (which might be part of the reason they're afraid to open source).

Ultimately the market is going to force them to open up and let people flex their subs.

Aurornis3mo ago

> Being locked into a specific harness because you pay 20 bucks per month vs. pay-per-use ... is kinda dumb.

I’ll probably get downvoted for this, but am I the only one who thinks it’s kind of wild how much anger is generated by these companies offering discounted plans for use with their tools?

At this point, there would be less anger and outrage on HN if they all just charged us the same high per-token rate and offered no discounts or flat rate plans.

toraway3mo ago

Okay, but why on earth should I as an OpenCode user accept that limitation when OpenAI explicitly supports 3rd party clients? That's how competition works in a healthy market.

I certainly haven't built up enough brand loyalty to tolerate Anthropic's behavior as they tightened usage quotas on the Pro plan to the point of becoming unusable for actual development.

(And sure, they probably don't care because they're losing money on that plan but again, OpenAI offers a plan at the same price point with vastly superior usage limits so I just canceled Claude, subscribed to Codex and moved on with my life. Anthropic's profit margin or lack thereof isn't my problem as a consumer when alternatives exist.)

1 more reply

senordevnyc3mo ago

No, you're not the only one. The outraged entitlement is pretty funny tbh. How dare they dictate that they'll only subsidize your usage if you use their software!!

2 more replies

horsawlarway3mo ago

Also another place where having it change out from underneath you can drastically alter the quality of your work in unexpected ways.

Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.

Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.

---

The right route is open models and open harnesses, ideally on local hardware.

deaux3mo ago

At this point subsidizing Chinese open-weights vendors by paying for them is just the right thing to do. Maybe they too might go closed-weights when they become SotA, but they're now pretty close and haven't done it.

1 more reply

Aurornis3mo ago

> Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.

I don’t assume this at all. In fact, the opposite has been happening in my experience: I try multiple providers at the same time and the $20/month plans have only been getting better with the model improvements and changes. The current ChatGPT $20/month plan goes a very long way even when I set it to “Extra High” whereas just 6 months ago I felt like the $20/month plans from major providers were an exercise in bouncing off rate limits for anything non-trivial.

Inference costs are only going to go down from here and models will only improve. I’ve been reading these warnings about the coming demise of AI plans for 1-2 years now, but the opposite keeps happening.

2 more replies

jahala3mo ago

I implemented this hash (read and edit) approach in tilth if you want to test it out.

https://github.com/jahala/tilth

its on npm and cargo:

- cargo install tilth

- npx tilth

then tilth install claude-code/windsurf/cursor --edit

(--edit flag is needed)

I made "tilth" a few days ago, since I'm consistently trying to get the LLMs to use tools more efficiently and spend less tokens doing it -- original tilth post from Monday: https://news.ycombinator.com/item?id=46952321

hedgehog3mo ago

You might find it useful for markdown as well, especially if you add support for section-based addressing (e.g. cat or replace a section at a time). Section-based addresses are nice because they tend to be stable across versions.

jahala3mo ago

Great idea - Just implemented this.

(Already published on cargo, on npm in a few mins).

kachapopopowOP3mo ago

benchmarks vs grep?

jahala3mo ago

tilth isn’t trying to replace grep for raw text search — for that, it wraps ripgrep internally so perf is comparable. It’s about reducing round-trips and giving the agent a verified edit workflow, not faster search.

Instead of cat + grep + manual line counting, one tool call returns a structural outline of a large file, lets you drill into sections, and since this last update also returns hashline-anchored output that an edit tool can target.

kachapopopowOP3mo ago

well yah, that's what I mean how better is it versus cat + grep + manual line counting. Agents tend to perform worse with niche tools

2 more replies

clx753mo ago

During my first LLM experiments in Emacs using gptel, I also found that the LLM has considerable difficulties changing source code files with the Unix patch tool.

As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).

Worked like a charm.

badhorseman3mo ago

Sounds interesting, do you have the code to share.

clx753mo ago

Tool definitions: https://github.com/cellux/dotfiles/blob/master/.emacs.d/rb-g...

Implementation: https://github.com/cellux/dotfiles/blob/master/.emacs.d/rb-t...

tosh3mo ago

Shows how much room for improvement there is on the harness level.

Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.

Love the pragmatic mix of content based addressing + line numbers. Beautiful.

robbomacrae3mo ago

Indeed. The biggest waste might be the overuse of MCP for everything. Sure it makes the initial development easier but then for every connection you're using a hundred billion dollar parameter model to decide how to make the call when it's usually completely unnecessary and then prone to random errors. MCP is the hammer that can make literally everything look like a nail...

senordevnyc3mo ago

I see this ranting against MCP all the time, and I don't get it, maybe I'm missing something. I'm currently using an MCP in Cursor to give agents read-only access to my staging and prod databases, as well as BugSnag's MCP so it can look up errors that happen in those environments. It works great. What should I be using for this if not MCP?

visarga3mo ago

Make a CLI tool for it, of course

1 more reply

canadiantim3mo ago

agent skills, or use claude code to iteratively condense an MCP you want to use into only its most essential tools for your workflow

1 more reply

chasd003mo ago

i haven't dug into the article but your comment reminded me about the ClaudeCode Superpowers plugin. I find the plugin great but it's quite "expensive", I use the pay-as-you-go account with CC because i've just been trying it out personally and the superpowers plugin spends a lot of money, relative to regular CC, with all the back and forth.

With CC you can do a /cost to see how much your session cost in dollar terms, that's a good benchmark IMO for plugins, .md files for agents, and so on. Minimize the LLM cost in the way you'd minimize typical resource usage on a computer like cpu, ram, storage etc.

kachapopopowOP3mo ago

you can actually go the other way and spend more tokens to solve more complex problems (multi-agent) by letting agents work with smaller problems

kachapopopowOP3mo ago

My personal notes (not the author): have been way faster performance wise which is honestly the biggest improvement over correctless. I've posted https://github.com/can1357/oh-my-pi before, but didn't seem to gain traction. It's a great little agent.

mijoharas3mo ago

I've just started messing around with pi, but haven't fully dug in yet. How would you compare oh-my-pi? I see it has a lot of other bells and whistles built in.

Are they portable bit by bit back to pi, or is there enough differences that they can't? how about normal pi extensions, can they be used in omp?

Some of the stuff definitely looks interesting.

kachapopopowOP3mo ago

the differences are documented but it is mostly 1:1, never used normal pi, but night and day difference compared to opencode, don't forget omp setup python.

scotth3mo ago

I'm into it! This looks like an experimentation platform. OpenCode is beginning to feel like handcuffs. Let me hack!

antihero3mo ago

But will you get banned for using it?

animan3mo ago

What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

bri3d3mo ago

When you buy a subscription plan, you’re buying use of the harness, not the underlying compute / tokens. Buying those on their own is way more expensive. This is probably because:

* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.

* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).

* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.

sigmar3mo ago

He wasn't using the regular paid api (ie per token pricing). He was using the endpoints for their subscribed customers (ie paid per month and heavily subsidized).

infecto3mo ago

I assume he was using Gemini the same way as he was Claude when I make the following statement.

I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.

deaux3mo ago

Indeed, that's why Anthropic, OpenAI and other LLM providers are known to adhere to published APIs to gather the world's data, obeying licensing and ROBOTS.txt.

It's truly disgusting.

skybrian3mo ago

I was under the impression that they do obey robots.txt now? There are clearly a lot of dumb agents that don’t, but didn’t think it was the major AI labs.

1 more reply

DANmode3mo ago

Why does Google/Facebook et al arbitrarily enforce one human per account?

It’s because they want to study you.

They want the data!

logicallee3mo ago

>What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!

joshuamoyers3mo ago

I think this is the right take. I usually am aligned with most of what Anthropic is doing, but cutting off OAuth login from open harnesses was a bad move. My guess is there is some serious worry/overlap with the Cursor's of the world - e.g. folks who will be competitors in the future who are taking advantage of cheaper Opus rates/loss leader from them while simultaneously building a competitive model (Composer).

Also, nice clever optimization here. Lots of low hanging fruit in harness land.

ianbutler3mo ago

It’s funny to see where we are on model improvements.

Back when I was maintaining a coding harness around the time of Claude 3.5 we tried hash prefixes we tried line number prefixes we tried a lot of different approaches to making the model better at selecting edit blocks and ultimately at-least then fuzzy string matching won out.

jbellis3mo ago

Yes, very similar results here (http://brokk.ai)

We got lines-with-anchors working fine as a replacement strategy, the problem was that when you don't make the model echo what it's replacing, it's literally dumber at writing the replacement; we lost more in test failures + retries than we gained in faster outputs.

Makes sense when you think about how powerful the "think before answering" principle is for LLMs, but it's still frustrating

znnajdla3mo ago

My experience as well. People worry our profession is being reduced to "prompt engineer", but actually I get the feeling that programming will soon be mainly about designing and building harnesses for specific tasks.

ambicapter3mo ago

Personal opinion is that LLMs are definitely not as magical as people think they are, they fill a specific niche of problem-solving, and harnesses are necessary to corral your problem into the niche that they are extremely good at solving.

cruffle_duffle3mo ago

The more I dive into this space the more I think that developers will still be in heavy demand—just operating in a different level of abstraction most of the time. We will need to know our CS fundamentals, experience will still matter, juniors will still be needed. It’s just that a lot of time time the actual code being generated will come from our little helper buddies. But those things still need a human in the seat to drive them.

I keep asking myself “could my friends and family be handed this and be expected to build what I’m building on them” and the answer is an immediate “absolutely not”. Could a non technical manager use these tools do build what I’m building? Absolutely not. And when I think about it, it’s for the exact same reason it’s always been… they just aren’t a developer. They just don’t “think” in the way required to effectively control a computer.

LLMs are just another way to talk to a machine. They aren’t magic. All the same fundamental principles that apply to probably telling a machine what to do still apply. It’s just a wildly different mechanism.

That all being said, I think these things will dramatically speed up the pace that software eats the world. Put LLMs into a good harness and holy shit it’s like a superpower… but to get those superpowers unlocked you still have to know the basis, same as before. I think this applies to all other trades too. If you are a designer you still have to what good design is and how to articulate it. Data scientists still need to understand the basics of their trade… these tools just give them superpowers.

Whether or not this assertion remains true in two or three years remains to be seen but look at the most popular tool. Claude code is a command line tool! Their gui version is pretty terrible in comparison. Cursor is an ide fork of vscode.

These are highly technical tools requiring somebody that knows file systems, command lines, basic development like compilers, etc. they require you to know a lot of stuff most people simply don’t. The direction I think these tools will head is far closer to highly sophisticated dev tooling than general purpose “magic box” stuff that your parents can use to… I dunno… vibe code the next hit todo app.

keybored3mo ago

> The more I dive into this space the more I think that developers will still be in heavy demand—just operating in a different level of abstraction most of the time. We will need to know our CS fundamentals, experience will still matter, juniors will still be needed. It’s just that a lot of time time the actual code being generated will come from our little helper buddies. But those things still need a human in the seat to drive them.

It’s disheartening that programmers are using this advanced, cutting-edge technology with such a backwards, old-fashioned approach.[1]

Code generation isn’t a higher level abstraction. It’s the same level but with automation.

See [1]. I’m open to LLMs or humans+LLMs creating new abstractions. Real abstractions that hide implementation details and don’t “leak”. Why isn’t this happening?

Truly “vibe coding” might also get the same job done. In the sense of: you only have to look at the generated code for reasons like how a C++ programmer looks at the assembly. Not to check if it is even correct. But because there are concerns beyond just the correctness like code gen size. (Do you care about compiler output size? Sometimes. So sometimes you have to look.)

[1]: https://news.ycombinator.com/item?id=44163821

neversupervised3mo ago

I believe you’re arriving at the wrong conclusion because you’re comparing to an opposite instead of to someone slightly worse than you. Will this enable people at the edge to perform like you? That’s the question. Will there be more developers? Will they compete with you?

skydhash3mo ago

> LLMs are just another way to talk to a machine. They aren’t magic.

I will still opt for a scriptable shell. A few scripts, and I have a custom interface that can be easily composed. And could be run on a $100 used laptop from ebay.

kristopolous3mo ago

I independently invented a very similar method and then abandoned it because it relies on abstraction.

Instead I now use Damerau-Levenshtein distance to assert the edits to be replaced and if the similarity is over some threshold the edit goes through

Really works well because it's explicit. Forcing the model to emit the source tokens to be replaced seems to improve things.

https://github.com/day50-dev/sidechat/blob/db9c8f9d834967442...

It will often chomp white space differently but the main problem is

1. Track alignment with the lines being tracks (hash fixes that)

2. Content alignment with the model not losing focus (hamming/levenshtein other similarity scores fixes that)

If we demand exact matches we're simply not going to get them.

(Combining both methods might be good, I hadn't thought of that)

Another crucial point: the error line "Content mismatch. Reread the file" is crucial. Errors should give descriptive remediate actions.

So even with crappy models it does this automatically and will tool loop accordingly.

Asking it to do smaller edits is no good. Many smaller models will go down to single line edits, looking around for blank lines and just inject garbage. So don't suggest it.

Larger models, which succeed in doing this, know to do that. Smaller models which don't, won't do it if you don't suggest it

Seriously this thing works with 4B models

I also combine it with a toolcall hack for models that don't support tool calling

https://github.com/day50-dev/sidechat/blob/db9c8f9d834967442...

It injects the tool description in the system prompt after probing the capabilities and then does a simple response router.

I haven't found a model within reason that this doesn't work with (I'm sure if you intentionally throw some fine tune botch up that's emitting garbage it'll break - that's not the claim)

YMMV, works for me™

Bolwin3mo ago

You forgot to mention your tool does worse for 8/16 LLMs compared to replace?

Problem is, replace has been around for so long, most LLMs are tuned for it now

WiSaGaN3mo ago

Great article. To me, this highlights a key question in the era of rapidly advancing machine intelligence: if we know machine intelligence is progressing, what is more valuable to build for? As humans, we still find many tools useful even when doing knowledge work. For instance, a calculator. Sure, a smart person can perform calculations in their head, but it’s much easier to teach everyone how to use a calculator, which is 100% reliable in its intended domain.

In this era, we should build these kinds of tools for problems we know are straightforward ones you can’t get smarter than, even as intelligence continues to advance. Using tools like "bash" or command-line interfaces originally designed for humans is a good initial approach, since we can essentially reuse much of what was built for human use. Later, we can optimize specifically for machines, either accounting for their different cognitive structures (e.g., the ability to memorize extremely long contexts compared to humans) or adapting to the stream-based input/output patterns of current autoregressive token generators.

Eventually, I believe machine intelligence will build their own tools based on these foundations, likely a similar kind of milestone to when humans first began using tools.

rao-v3mo ago

I’d really like to see this optimized for the 50-120B parameter open source models that are local viable (gpt-oss-120b, qwen3-80b-3a etc.).

For them I think it would be optimal to provide a tag per function and trust the llm to rewrite the function. As the article notes full reproduction is generally more reliable than edited for short code.

The token and attention overhead from a per line hash I suspect limits this approach for smaller models

planckscnst3mo ago

There is so much work we can do with harnesses that can make the already existing models so much more capable. I definitely feel the author's frustration as I've also been working on some harness stuff. When Anthropic subscriptions got cut off from OpenCode and other third party tools, I was very disappointed because the model I do the most work in is Claude and I was specifically developing a change [1] in the hopes it would make Claude even better. After that, I started implementing the feature in Claude Code directly (using tweakcc) and after a day of working on that, they even block my tweaked Claude Code with the same message. It means I simply won't be able to use this idea with Claude at all

[1]: the README.md describes the Context Bonsai features in my fork here: https://github.com/Vibecodelicious/opencode

KingMob3mo ago

Intriguing, but I wonder if they've looked at whole-conversation token usage, though, and not just short tasks.

I just saw a [paper](https://arxiv.org/pdf/2602.05447) that investigated similar aspects of TOON (which aims to reduce JSON tokens), and they found that even though TOON itself reduced the number of tokens, LLMs were less familiar with it, and thus spent even more tokens trying to decipher it, or making mistakes (see section 4.5, figures 6 and 7).

From the paper: >Unlike Markdown, where each grep hit simply returned more text, TOON's overhead was driven by a combination of output density and additional tool calls from pattern unfamiliarity

---- There's a strangeness tax with LLMs, and it can be substantial.

I would not be surprised at all if this technique turned out to be only a local minimum, with detrimental global effects.

nekitamo3mo ago

Getting banned from Gemini while attempting to improve Gemini is the most Googley thing ever :D imagine letting your automated "trust and safety" systems run amok so that they ban the top 0.01% of your users with no recourse. Google really knows how score an own-goal.

sgc3mo ago

I really don't understand what is his usage pattern would have triggered that obviously automated ban. Can somebody let me know what they might think is adversarial enough to be considered 'hacking' or similar by a bot?

nateb20223mo ago

Google is dealing with a wave of abuse over its Antigravity IDE, with 'account switching' tools designed to use a ton (20+) of free or pro accounts, giving the user essentially unlimited usage. I'm guessing they've deployed some rather aggressive countermeasures to stop this, including banning clients that seem to be accessing "private" APIs outside of a Google product.

prostheticrazor3mo ago

We build MCP servers that wrap fund APIs. The biggest performance variable we’ve found isn’t the model, it’s how much domain context the harness provides before the model has to reason. Same model, generic prompt versus one loaded with our procedural docs - wider gap than switching between model generations. Which surprised me.

The post’s framing is right but undersells what the harness actually does in production. It’s your trust layer: what can the model touch, what can’t it, how cheaply do you recover when it gets something wrong. We spend something like 70% of engineering time on the recovery path, not the inference. Whether that ratio is right I’m not sure, but it’s where we’ve ended up.

On MCP overhead downthread: real, yes. In regulated environments you need the audit trail and the kill switch, and a tool boundary is how you get those. The unsolved part is keeping the protocol thin enough that you’re not burning tokens on ceremony.

spyder3mo ago

Yea, LLMs have prompt-, harness- and even random seed variability, and it leaves you wonder maybe with a better prompt or system instruction a model could perform better. Too bad most benchmarks don't report that variability, because it could reveal that the model may only perform well if it's prompted in the style of their training data and not generalize well to unseen prompt styles. Also it could explain some of the benchmark vs real world usage gaps.

I remember some papers about earlier models having around 15% prompt variability, and with different tool use sometimes there are even more significant jumps. And if I remember correctly the reasoning models improve some of these because lot of the early prompting tricks is included in them like "thinking step-by-step", "think carefully" and some other "magic" methods. Also another trick is to ask the models to rephrase the prompt with their own words because that may produce prompt that better align with their training prompts. For sure the big model developers are aware of these and constantly improving it, I just don't see too much discussion or numbers about it.

andai3mo ago

I haven't been able to find it again, but a few years ago I read a paper that found that certain prompts massively improved the performance of some LLMs on benchmarks. But the same prompt massively reduced the performance of some other LLMs. I assume this is still true, though perhaps not as dramatically as before.

scotty793mo ago

Harness is where the open source should shine. It doesn't require millions of dollars of compute but the search space is vast and explorable with limited budgets.

edg50003mo ago

We know Anthropic likes vertical integration and their walled garden. OpenAI seems to be okay with custom clients using their flat rate subscriptions. But what about Google? Would be great to have a second model that allows use by any client of their flat rate subscriptions. Agentic work with APIs seems to be insanely expensive, so clients need to be able to use the flat rate subscriptions unless you have big bucks.

perrygeo3mo ago

Witness the giant leap forward in the capabilities of coding agents over the last year. There has been no such leap in LLM model performance. I think the causality is crystal clear. It's nothing about "AGI" and all about existing LLMs learning to use existing tools.

Even a sub-par LLM, put into a context where it has access to unix tools and network and files etc, is vastly more capable than the best LLM chatbot.

fcanesin3mo ago

The harness is the model "body", it's weight the cognition. Like in nature they develop together and the iteration of natural selection works at both.

If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.

TZubiri3mo ago

But they rely on distilling the output of american leader models. Which will probably train against their own harness.

Someone has to do the baseline training, development, and innovation. it can't be clones all the way down

yowlingcat3mo ago

It goes the other way around as well. DeepSeek has made quite a few innovations that the US labs were lacking (DSA being the most notable one). It's also not clear to me how much of distilled outputs are just an additional ingredient of the recipe rather than a whole "frozen dinner" so to speak. I have no evidence to say it's one way or the other, but my guess is the former.

robotresearcher3mo ago

Why not? Humans are (very nearly) clones all the way down.

lillecarl3mo ago

Citation needed, SOTA labs surely has technical protection and legaleese against using them for training. It's been done in th past but what indicates this is still the case?

TZubiri3mo ago

>Citation needed, SOTA labs surely has technical protection

They have unlimited APIs, as long as you pay, how would they control how you use them?

> and legaleese against using them for training.

It's a whole different jurisdiction, and in general chinese companies care way less about copyright infringement

https://en.wikipedia.org/wiki/Counterfeit_consumer_good https://en.wikipedia.org/wiki/Allegations_of_intellectual_pr... https://en.wikipedia.org/wiki/China%E2%80%93United_States_tr...

cyanydeez3mo ago

this didn't stop the millions of copyrighted works used to train the models.

troyvit3mo ago

Mistral recently came out with their own harness (vibe) and I feel like it was a massive missed opportunity vs throwing in with with aider or opencode.

parhamn3mo ago

On first principles it would seem that the "harness" is a myth. Surely a model like Opus 4.6/Codex 5.3 which can reason about complex functions and data flows across many files would trip up over top level function signatures it needs to call?

I see a lot of evidence to the contrary though. Anyone know what the underlying issue here is?

znnajdla3mo ago

How hard is it to for you to assemble a piece of IKEA furniture without an allen wrench, screwdriver, and clear instructions, vs with those 3?

0x4573mo ago

Well, I assembled Alex once without instruction and with impact driver and hammer last year. Hardest part was to make tools fit.

parhamn3mo ago

You didn't read the article it seems (or the analogy is a bad one). The differences are much more subtle than having a screwdriver or not.

znnajdla3mo ago

I did read the article quite enthusiastically and my practical experience confirms the same. Sure the difference is more subtle. But my point was, an "agent" whether human or AI can be a lot more productive with better tools. This guy found a better screwdriver than the most commonly used one. That's amazing and nothing from "first principles" denies that a better tool harness would mean better/faster/more correct AI agents.

33713mo ago

If you agree that current LLMs (Transformers) are naturally very susceptible to context/prompt, then you can go on to ask agents for a "raw harness dump" "because I need to understand how to better present my skills and tools in the harness", you maybe will see how "Harness" impact model behavior.

robotresearcher3mo ago

Humans have a demonstrated ability to program computers by flipping switches on the front panel.

Like a good programming language, a good harness offers a better affordance for getting stuff done.

Even if we put correctness aside, tooling that saves time and tokens is going to be very valuable.

manbash3mo ago

The models generalized "understanding" and "reasoning" is the real myth that makes us take a step back and offload the process deterministic computing and harnesses.

madeofpalk3mo ago

Isn't 'the harness' essentially just prompting?

It's completely understandable that prompting in better/more efficient means would produce different results.

furyofantares3mo ago

No, it's also a suite of tools beyond what's available in bash, tailored to context management.

brutuscat3mo ago

Look at Serena. Does something on this line.

> it can use code-centred tools like find_symbol, find_referencing_symbols and insert_after_symbol.

https://oraios.github.io/serena/01-about/000_intro.html

badhorseman3mo ago

I feel a lot of confusion at which coding harness is best and what options to use. tbh I have mostly used standard aider and I don't know what the consensus is on this tool.

I feel I want to write my own and that maybe in the future a lot of developers will have custom harnesses and have highly customized versions as each user of these models wants to use these things in a way that's unique to their brain, much like how emacs is so great for the customization but one persons emacs config is often not what another wants or only wants a subset and then write their own features.

As an aside what is the feeling on all the various ai coding tools, does aider suck is that aider-ce/cecli are better or are the bespoke tools for each model like claudeCode and such better.

0xbadcafebee3mo ago

Putting it out there: if any frontier model provider starts allowing any agent to use their $20/month plan, we will all switch to you. We don't want to be forced into 1 harness, we want OAuth, and we want respectable limits without excessive budgets.

aniviacat3mo ago

How would that differ from buying $20 worth of API credits each month?

0xbadcafebee3mo ago

1) security (oauth is much more secure than a static api key; if your key gets stolen, a hacker can run up your bill)

2) AFAIK the $20/month plan allows use of more tokens per month than if you bought $20 of tokens. my understanding is it assumes most users will only use a fraction of that each month, and they rake in profit (like a gym membership)

pcwelder3mo ago

Great work, but concurrency is lost.

With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.

Have you tested followup edits on the same files?

kachapopopowOP3mo ago

(not the author) it works fine most of the time been using it alongside an active agent and haven't ran into too many noticable problems. The token savings alone are worth it.

wrsh073mo ago

Serializing writes is probably fine and the hashes should only change if you're updating the same line, right?

You probably don't want to use the line number though unless you need to disambiguate

But your write tool implementation can take care of that

aszen3mo ago

So the new implementation always operates at the line level, replacing one or more lines. That's not ideal for some refactorings like rename where search and replace is faster.

Edit

Checking ohmypi The model has access to str replace too so this is just a edit till

aSidorenkoCode3mo ago

I made a benchmark on the top tier models as this is not the case in the article. Also I did several cases and the result is self speaking. Hashline is not an improvement that speaks for itself. It is the overhead of harness itself in tool descriptions and schemas. Here are my benchmark results and also my repo for a plugin to truely reduce token usage in top tier models: https://github.com/ASidorenkoCode/openslimedit

rafaelmn3mo ago

I wonder if we'll get to "VI for LLMs" - if the model was trained on using that kind of text navigation and you show context around cursor when it navigates.

Would also be worth having special tokens for this kind of navigation.

1313ed013mo ago

I always thought ed would be a perfect match. Line-based instead of having to manage cursor movements.

antonok3mo ago

I had the same thought too. It's probably not too difficult to fine-tune a small model for it using the "introduce a random mutation and describe the issue" workflow from TFA

cousinbryce3mo ago

I bet it’s good enough at VI already

jcims3mo ago

I ran into this from the other direction. I built a small SRE agent for my cloud infra and just kind of walked into hand-rolling some of the tools rather than using what exists today. I provided an edit_file tool that felt like it was of reasonable capability, but in practice the agent was regularly 'trying' to do a one line change and submitting PRs that hallucinated 3/4s of the file.

Seeing how bad the results are when you're casually approaching something makes it very evident that it's a topic that can be optimized.

indubioprorubik3mo ago

My guess always was that - if you took the source of training data- meaning the authors of the "best" answers and solutions on stackoverflow or github- and got the question reformatted, to sound like it was created by these experts- the created code, would try to hug these sources of truth while getting created.

So, the challenge is actually to find a map of "problem" to "author" and then from "author" to "related code" and from their to a solution.

socketcluster3mo ago

Seeing all these 'coding' benchmarks reminds me that people still don't understand what coding means in practice. People still think one-phase puzzle-solving is coding. Real coding almost always has multiple phases which build on top of one another. There is an architectural component which is missed here - and the sheer number of phases/layers is actually where most of the complexity comes from.

cyanydeez3mo ago

Usually what I need a LLM to do is find me a elegant agorithm for a problem I've encountered where I know there's an elegant algorithm but I've got no idea what it's called or how to google search for it.

socketcluster3mo ago

It makes sense but if the goal is to replace software engineers as claimed, then these benchmarks aren't going to achieve that.

Companies are still stuck in this mindset conflating software engineering with puzzle-solving. This is evident from their job interviews and also these LLM benchmarks.

alexandriaeden3mo ago

This matches my experience exactly. I’ve been building an MCP server with 82 tools and spent weeks on infrastructure testing. Switching from a Docker-based Cloudflare Tunnel to a native tunnel process took my tool call success rate from ~50% to 100% — same model, same tools, same prompts. The harness isn’t just important, it’s often the dominant variable.

giancarlostoro3mo ago

One of the first things I add to my claude instructions file is to stop using grep, its awfully slow, just use ripgrep instead, you can just type the word of what you're looking for from the project root and find it all in one shot. Claude likes to go folder by folder with grep and it drives me crazy.

"You're absolutely right!"

At this point I'd take a contract with Anthropic to have Claude code pick better tooling.

MetaWhirledPeas3mo ago

> Treating harnesses as solved, or even inconsequential, is very short-sighted

Is it possible that burning extra tokens is the point, since they get paid more?

vlovich1233mo ago

Given the fierce competition, I would imagine a better performing model generates more revenue than burning extra tokens

dack3mo ago

they have pretty fierce competition though, so i doubt this is intentional. my guess is they just have a million things to do and that isn't at the top of the list

naasking3mo ago

That doesn't make sense with subscriptions.

subscribed3mo ago

It does, £15 Claude Pro licence is 2 hours with a small code base and Serena.

softwaredoug3mo ago

Underrated is how much improving harnesses, not just models, has a lot to do with productive uses of LLMs at tasks like coding in the last year.

christophilus3mo ago

Has any harness matched the effectiveness of Claude Code yet? I haven't experimented much recently, but every time I have in the past, I wasn't able to get any other tool to approach how effective I am in CC.

I'd love to use a different harness-- ideally an OSS one-- and hook it up to whichever LLM provides the best bang for the buck rather than being tied to Claude.

antonok3mo ago

OpenCode has been great in my experience. I still get the best results using it with Anthropic's models, but some of the open weights ones are catching up (GLM 5 works reasonably well for me).

notsylver3mo ago

I feel like cursors solution is still the best answer. Let the model suggest edits in whatever format it prefers using as few "extra" tokens as possible and have a small model figure it out. I don't use cursor anymore but when I did it was impressive how consistently it worked, I think there was a single time it failed. 70b might be overkill though...

mromanuk3mo ago

Someone should try prompting the same LLM in use, to suggest an edit as a subagent.

uriegas3mo ago

I do agree with his identification of the problem: sometimes agents fail because of the tools around it and not because of the model's reasoning. However, for the failing tests I think he is not making the distinction between a failed test due to a harness failure or due to a reasoning failure. It would be nice if someone analyzed that from the data set.

znnajdla3mo ago

Yep this has been my experience with browser agents as well. One little change in the harness/agentic loop and the model suddenly becomes a whole lot smarter at navigating the web. I was also able to build a better browser agent than ‘claude —chrome’ in just a few afternoons just by tweaking the harness.

tgtweak3mo ago

When you're in the business of selling tokens - you look at technology that reduces that as a threat. If they were selling services that USE tokens, then reducing them would be welcome... so they'll likely steal this and incorporate it into their proprietary CLIs like claude code...

MarsIronPI3mo ago

Huh? Anthropic doesn't sell Claude Code, they sell tokens. Why would they make Claude Code more token-efficient?

tgtweak3mo ago

Anthropic sells api tokens - until they released claude code the only way to use claude for coding was via api tokens in something like cursor or cline.

MarsIronPI3mo ago

Exactly. They don't make money on Claude Code directly, so it's not in their interest to make it use fewer tokens (which are what they make their profit on).

spprashant3mo ago

It seems like agentic (or atleast AI-assisted coding) is the future. And we will be increasingly relying on these models to earn our livelihood.

Is anyone else worried at how easily Anthropic/Google/OpenAI can basically cut you off if you do something they don't like?

georgemcbay3mo ago

> Is anyone else worried at how easily Anthropic/Google/OpenAI can basically cut you off if you do something they don't like?

Yeah, had that thought here a few weeks ago on HN after reading about someone getting cut off from Claude:

https://news.ycombinator.com/item?id=46723384#46728649

Though tbh I'm far more worried about the societal impacts of large scale job displacement across so many professional industries at the same time.

I think it is likely to be very, very ugly for society in the near term. Not because the problems are unsolvable, but because everyone is choosing to ignore the threat of them.

And I realize a lot of people will handwave my concerns away with stories of Luddites and Jevon's paradox, but we've never had a tidal wave this big hit all at once and I think the scale (combined with speed of change) fundamentally changes things this time.

lbreakjai3mo ago

I stopped worrying. Western societies have about 30 to 40% of the people doing knowledge work, which contributes to the economy that employs the other 60%.

If that 40% is automated away in one go, there's no economy as we know it anymore. Either it acts as a negative void coefficient and moderates it into something sustainable, or it blows up.

yowlingcat3mo ago

It's a very concerning future. I would love to live in a world where we could simply stop them from doing that, but for the moment, the best hedge appears to be the Chinese open weight models that can't be put back in the box and provide the valuable market function of commodifying the encoded knowledge of these models (which in and of itself was derived from knowledge not created by the frontier lab).

energy1233mo ago

I feel the baseline comparison should be relative to the intuitive and simple "line-numbers only" schema.

It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.

withinboredom3mo ago

The issue is when the file changed between when the LLM read the file and when it wrote to the file. Just using line numbers will clobber a file if that happens. The hashes prevent that from being an issue.

energy1233mo ago

Point taken.

kachapopopowOP3mo ago

it starts writing to the wrong part of the file after multiple edits.

mehdibl3mo ago

You can improve a lot the success rate with providing HELM and clear instructions with the tool description.

Over a year ago had a lot of issues and the description and example was the difference between 30-50% failure to 1%!

So I'm surprised a bit about the point. May be I'm missing it.

visarga3mo ago

Yeah I invented a similar method for information extraction attribution around 2022, I would place custom markers in a document so the extraction model can reference them together with the answer and be unique on the document to be able to locate it.

bicepjai3mo ago

I enjoyed the post. Sometimes small hacks go a long way. I still feel like the game changer will be models outputting just enough code for replacement and harness replacing it efficiently rather than whole code with find and replace hacks.

theanonymousone3mo ago

Non-native speaker here. Can someone please be so nice to explain why do we use the word "Harness" here and not e.g. Orchestrate or Steer?

It took me some time to realise what people mean by it, originally confusing it with harvest.

mikodin3mo ago

What always came to mind for me is an “engine wiring harness”. It’s responsible for getting power and data to all the right places without having to manually route cables around the engine / car.

If you google an image of it, maybe it’ll make sense

vohk3mo ago

As already mentioned, this is the noun use but also different connotations.

To my thinking, to orchestrate or steer suggests a conductor or driver, an outside entity providing direction. A master agent creating and directing subagents could reasonably be called an orchestrator.

A harness is what the horse wears to pull a cart, or what connects a pilot to a parachute and provides the controls to tug on and steer. It might provide guidance or capability, but not active direction. It's also a fairly common use in hardware ( a wire harness) and software (a testing harness) already.

energy1233mo ago

Well, "Orchestrate" and "Steer" are verbs, while "Harness" is a noun. You need a noun here, not a verb, because the harness is not actively doing anything, it's just a set of constraints and a toolset.

viraptor3mo ago

That doesn't really answer the questions, because there's orchestrator and steering.

jwr3mo ago

Interesting. I wonder if that's one of the reasons Claude Code works so incredibly well for me with Clojure — I use clojure-mcp tools, which provide structured editing, and the model uses that to edit.

swiftcoder3mo ago

Please, please turn up the contrast on your darkmode text! The headings are fine, but the paragraph text fails accessibility guidelines badly (and I personally find it very hard to read)

0xdeafbeef3mo ago

Filed an issue for codex

https://github.com/openai/codex/issues/11601

jwpapi3mo ago

Great article and tbh I thought it would’ve been implemented that way makes sense to hash and save mainly context I don’t expect them to care about token usage

How about Kimi tho how can I play with it?

pieterk3mo ago

This is awesome. Adopted it in my personal pi-mono repo :)

It had some bugs around whitespace replacements but the model seems happy with it now.

Thanks and keep it up! Shoutout to @jahala for tilth as well.

jbetala73mo ago

I switched from a basic prompt wrapper to structured tool use with Claude Code and the quality of output jumped overnight. Same model, completely different results.

a11r3mo ago

This is very nicely done. We have seen the same issue at a higher level of getting separators right when generating multiple files in a single inference call.

aghilmort3mo ago

curious: wdym by "getting separators right when generating multiple files in a single inference call"

context: created hypertokens an even more robust hashing mechanism to create context-addressable memory (CAM), one cheat code is make them prefix-free, lots of others that get deep into why models work the way they do, etc.

jwpapi3mo ago

Arguably I would think that the last year was mainly inner harness improvement instead model improvement but I could be wrong, just feels like that to me

SatvikBeri3mo ago

We can measure this by looking at the same harness applied to different models, e.g. the very plain Terminus: https://www.tbench.ai/leaderboard/terminal-bench/2.0?agents=...

Models have improved dramatically even with the same harness

jwpapi3mo ago

I mean that just the way it tackles task in the core is generated differently, like inner harness, through system prompt or deeper root. F.e. Instead of answering instantly it goes through a pre-defined steps which strategy should be done, split task, use thinking tokens, use tools etc.

babkayaga3mo ago

Still weird to me that most people are not just giving an LLM an access to an editor, forcing it to write shell scripts to edit files. Shrug.

HarHarVeryFunny3mo ago

That's not quite how it works, and anyways if the model can't generate an accurate find/replace string, why would you expect it to do any better generating accurate commands to drive your editor (assuming it knew how do do that in the first place) ?!

The way edits happen is that the agent (local) first tells the model (typically remote) that it has an edit tool (e.g. taking parameters file name, find string and replace string). If the model decides it wants to edit a file then it'll invoke this edit tool, which just results in a blob of JSON being put in the model's response specifying the edit (filename, etc). The agent then receives the response, intercepts this JSON blob, sees that it is an edit request and does what is asked.

The problem the article is describing is that the edit request (tool invocation) generated by the model isn't always 100% accurate. Even if the agent told the model it had a tool to invoke an actual editor, say sed, assuming the model knew how to use sed, this is still going to fail if the edit request cannot be interpreted literally by the editor (due to being inaccurate).

cyanydeez3mo ago

Seems like it's veering towards a per-model protocol similar to the expectation that these models will develop their own languages to speak among themselves as agents.

The trouble is though, because it's all indeterminant slop, every model will break in small ways that you're back to indeterminancy and building a harness ontop of the harness.

Still, <nerd snipe>, there's probably a way to get the local model and arbitrary remote model to agree on how to make a method call. But the only way that will be fruitful if you find a highly reproducible set of tuples within the model's shared space.

znnajdla3mo ago

How do you give it access to an editor? It doesn't have a keyboard and mouse.

HarHarVeryFunny3mo ago

Well, it could be a batch editor, such as linux's sed, invoked from the command line, or with "computer use" the model could indeed potentially drive a real interactive editor.

Part of the problem though is that tools like Claude Code don't want to assume too much of the environment - that a specific editor is available, or even that it is running on a particular OS. The way it remains platform agnostic and not reliant on specific tools is by only having a dependency on Node.js, which provides file read/write support, so to implement an edit request the agent uses Node.js to read the file, itself implements the edit, then again uses Node.js to create the new updated file.

visarga3mo ago

I built a structural zoom tool, it would fit flat or tree like content into a 10K char budget. It can compress HTML, JSON, folders, zip files, logs, chat sessions, basically large files or collections of files. Moving around is done by range selection. The idea is to have the agent find its way iteratively to the target, while having the structure exposed. RAG would totally cut everything to pieces and put them in a hat. My approach is to follow the structure of a large content by a series of glimpses. Unfortunately I myself am not sure it is better to use this tool vs bash and python one off scripts.

XCSme3mo ago

Google banning you for benchmarking is crazy, are you sure that's the cause? How would they even know you are benchmarking?

evolly3mo ago

My experience exactly! I’ve recently become so tired of the Claude harness that I switched to OpenCode (which is extremely good compared to Claude). However, OpenCode is also tedious to change, and it inherits all the “good stuff,” like treating agents as Markdown files and all the dancing around with hooks/plugins/skills scattered all over the place. Getting stuck again and again, I’ve ultimately come to the conclusion that this must be solved by writing my own damn coding agent, with extensibility that’s acceptable for real-world engineering.

HumanOstrich3mo ago

Give Pi[1] a try. Comes pretty barebones out of the box, yet still provides a decent default experience. Extension points are all TypeScript if you want. There are a lot of examples[2] and some 3rd party extensions[3].

I'll point out that if you want permission prompts for certain behavior, you have to add that yourself. There's at least one example.

Edit: Just noticed the article's author is using a fork of Pi.

[1]: https://shittycodingagent.ai/

[2]: https://github.com/badlogic/pi-mono/tree/main/packages/codin...

[3]: https://github.com/nicobailon

wyre3mo ago

Before you build you own, try pi. It is what you are looking for.

[0] https://shittycodingagent.ai/

deaux3mo ago

Great article, recommend reading all of it.

> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.

This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.

techpression3mo ago

I mean they want to make money right? CC is a cool tool, but obviously they want you to use the api eventually if you’re even remotely a power user, 200/month for all you can eat tokens (well, until some arbitrary limit of the day kicks in) just doesn’t make sense when compared to api prices. In other words, CC should be seen as a software subscription.

deaux3mo ago

The token limit is the same whether used in CC or in other harnesses.

techpression3mo ago

Sure, but then Anthropic loses the possibility to upsell, show ads, telemetry, brag about number of users and how long they use it etc etc. Not necessarily what’s in there today, but what can be in there tomorrow. They also get the ability to much better fine tune backoffs etc from a purely technical side of things.

tartavull3mo ago

if you want to quickly try it on codex https://github.com/tartavull/hashline

avereveard3mo ago

I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approach

read_toc tool:

...

  {

    "name": "mcp",

    "qualified_name": "mcp",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp",

    "is_nested": false

  },

  {

    "name": "handler",

    "qualified_name": "handler",

    "type": "constant",

    "docstring": null,

    "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

    "is_nested": false

  },

....

update_content tool:

{

  "content": "...",

  "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",

  "project_root": ....

}

direwolf203mo ago

You should have patented this and you'd be rich.

rohitpaulk3mo ago

This sounds like a good optimization task to give a long-running agent. Ask it to come up with experiments and maximize the % of successful edits.

falkenstein3mo ago

really enjoyed reading this, although I'm a dumb farmer and it took me a while to understand lol

andai3mo ago

The VC economics are creating a reality distortion field where Anthropic is incentivized to burn more tokens so they can rent more GPUs so they can get more investment, and where I am incentivized to pipe the LLM inputs into `claude -p` and blast 50KB of useless proompt onto it so they don't ban me from their 95% discounted API endpoint.

azinman23mo ago

Why not just use line numbers?

renewiltord3mo ago

Forces you to read after every write. E.g. you edit line 15 to be two lines. Then now you need arithmetic for later vs earlier lines or you need to read full file to reindex by line number.

azinman23mo ago

Good point!

I just wonder how unique these hashes will be if only 2 characters. It seems like the collision rate would be really high.

aghilmort3mo ago

we dug into those sorts of questions with hypertokens, a robust hash for lines, code, tables/rows or any in-context token tagging to give models photographic memory

one mechanism we establish is that each model has a fidelity window, i.e., r tokens of content for s tag tokens; each tag token adds extra GUID-like marker capacity via its embedding vector; since 1,2,3 digit numbers only one token in top models, a single hash token lacks enough capacity & separation in latent space

we also show hash should be properly prefix-free, or unique symbols perp digit, e.g., if using A-K & L-Z to hash then A,R is legal hash whereas M,C is not permitted hash

we can do all this & more rather precisely as we show in our arXiv paper on same; next update goes deeper into group theory, info theory, etc. on boosting model recall, reasoning, tool calls, etc. by way of robust hashing

1 more reply

MrGreenTea3mo ago

The author writes that these hashes are 2 or 3 characters long. I assume depending on the line count. That's good for almost 48k lines. You have other issues then.

1 more reply

giancarlostoro3mo ago

I was wondering the same thing.

__mharrison__3mo ago

Is there a skill file I can use for these edits?

kacper-vstorm3mo ago

Great post!

benreesman3mo ago

The logical end state of this line of reasoning is a collective action problem that dooms the frontier lab establishment. You can't devote model capacity to having an attention transformer match nested delimiters or cope with bash and be maximally capable, you can't mix authentication, authorization, control plane, and data plane into an ill specified soup and be secure enough for any that isn't a pilot or toy ever.

If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.

logicallee3mo ago

I agree with this article completely, nice to see it presented quantitatively.

>re "only" the harness changed

In our experience, AI's are like amnesiacs who can barely remember what they did three minutes ago (their last autonomous actions might still be in their context if you're lucky), with no chance at remembering what they did three days ago. As such, the "harness" determines their entire memory and is the single most important determinant of their outcome.

The best harness is a single self-contained, well-commented, obvious, and tiny code file followed by a plain explanation of what it does and what it's supposed to do, the change request, how you want it to do it (you have to say it with so much force and confidence that the AI is afraid of getting yelled at if they do anything else) and a large amount of text devoted to asking the AI not to break what is already working. Followed by a request to write a test that passes. Followed by asking for its judgment about whether it broke what was already working on or not. All in one tiny crisp prompt.

With such a harness, it's able to not break the code one time in twenty. If you use reverse psychology and ask it to do the opposite of what you want, it rises to fifty-fifty odds you'll get what you're trying to do.

Don't believe me? You can watch the livestream (see my previous comments).

Baby steps toward Utopia.

j / k navigate · click thread line to collapse