I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.
I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.
I can say unironically that we haven't even tapped the full potential of GPT-4. The original one, from 2023. With no reasoning, no RL, no tool calling, no structured outputs, etc. (No MCP, ye gods!) Yes, it's possible to build coding agents with it!
I say this because I did!
Forcing yourself to make things work with older models forces you to keep things simple. You don't need 50KB of prompts. You can make a coding agent with GPT-4 and half a page of prompt.
Now, why would we do this? Well, these constraints force you to think differently about the problem. Context management becomes non-optional. Semantic compression (for Python it's as simple as `grep -r def .`) becomes non-optional. Bloating the prompt with infinite detail and noise... you couldn't if you wanted to!
Well, surely none of this is relevant today? Well, it turns out all of it still is! e.g. small fix, the "grep def" (or your language's equivalent) can be trivially added as a startup hook to Claude Code, and suddenly it doesn't have to spend half your token budget poking around the codebase, because -- get this -- it can just see where everything is... (What a concept, right?)
-- We can also get into "If you let the LLM design the API then you don't need a prompt because it already knows how it should work", but... we can talk about that later ;)
> grep def
Once you get to a codebase beyond a certain size, that no longer works.
I've for one found Serena https://github.com/oraios/serena , which you can install from right within Claude, to be a fairly fantastic code-interaction tool for LLM's. Both semantic search as well as editing. And with way less token churn.
If I do things for the love if it, the rules are different of course. But otherwise I will simply always accept that there are many things that improve around me, that I have no intimate knowledge of and probably never will, and I let other people work them out and happily lean on their work to do the next thing I care about, that is not already solved.
I would run it via calling AWS Bedrock API through AWS-CLI. Self iterating and simple. All execution history directly embedded within.
Soon after, I wrote a help switch/command to each script. Such that they act as like MCP. To this day, they outperform any prompts one can make.
Absolutely. I always advocate that our developers have to test on older / slower machines. That gives them direct (painful) feedback when things run slow. Optimizing whatever you build for an older "something" (LLM model, hardware) will make it excel on more modern somethings.
Hahaha yeah. This is very true. I find myself making ad hoc versions of this in static markdown files to get around it. Just another example of the kind of low hanging fruit harnesses are leaving on the table. A version of this that uses tree sitter grammars to map a codebase, and does it on every startup of an agent, would be awesome.
> My Weird Hill is that we should be building things with GPT-4.
I disagree, IMO using the best models we have is a good way to avoid wasting time, but that doesn't mean we shouldn't also be frugal and clever with our harnesses!
"Suggesting that a comment was generated by an LLM without evidence adds little to a discussion and in fact deflects from the point being made. Please refrain from this."
OpenAI used early versions of GPT-5.3-Codex to: debug its own training process, manage its deployment and scaling and diagnose test results and evaluation data.
Claude Code have shipped 22 PRs in a single day and 27 the day before, with 100% of the code in each PR generated entirely by Claude Code.
You can tell...
That’s when the future really starts hitting you.
> Here is why that is backwards. I just showed that a different edit format improves their own models by 5 to 14 points while cutting output tokens by ~20%. That’s not a threat. It’s free R&D.
He makes it sounds like he got a 5-14% boost on a top level benchmark, not 5% improvement on a narrow find and replace metric. Anecdotally, I don't usually have a lot of issues with editing in Claude Code or Cursor, and if there is an issue the model corrects it.
Assuming that it costs double the tokens when it has to correct itself, and find and replace errors are as prominent in actual day to day use as his benchmark, we're talking a 5% efficiency gain in editing token use (not reasoning or tool use). Given that editing must be less than 1/3 of the token use (I assume much less?), we're talking an overall efficiency gain of less than 1%.
This seems like a promising technique but maybe not a high priority in efficiency gains for these tools. The messianic tone, like assuming that Google cut off his access to suppress his genius editing technique rather than just because he was hammering their API also leaves a bad taste, along with the rampant and blatant ChatGPTisms in the blog post.
Not sure what they're calculating, but this seems to me like it could be many times more efficient than 20%.
Most harnesses already have rather thorough solutions for this problem but new insights are still worth understanding.
That's not a human. It's AI slop.
> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.
> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.
> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.
Codex does in fact use a schema for constrained sampling, it's here: https://github.com/openai/codex/blob/main/codex-rs/core/src/...
It still has to work to get an exact match, or at least I didn't read the code to see if there's any fuzzy matching used.
Note the two codex models were the only ones doing worse with the author's proposed format. The author found them doing better with replace than with apply patch, but since the author appears to be unaware that they use a schema for constrained sampling, I think a more realistic benchmark should enable constrained sampling for the apply test.
But this article hints at deeper wins to be had. Consider that these models are operating on source code, which is a verbose, noisy, textual serialization of the intended syntax / semantic trees. TFA improves accuracy by retro-fitting some structure onto the text. But what if models could operate directly on these underlying structures themselves?
As a data point, there are projects like OpenRewrite, which encode a ton of information, from formatting to types with globally resolved dependencies for each symbol in what they call a "Lossless Semantic Tree", so that there is ~0 ambiguity about the code. When I worked with OpenRewrite (in the era before LLMs, how quaint!) compared to other tools, it produced the best results for code transformations with the highest fidelity to the surrounding code.
Now imagine if the agent has access to such detailed information. It would not have to waste tokens figuring incidental things out like formatting. Although I haven't tested it out myself, I believe Moderne (the maintainers of OpenRewrite) when they say that agents armed with LST-based tools make extremely accurate changes.
This is essentially the same reason why the answer to "Which is better, Vim or Emacs?" is "IntelliJ."
Now consider that these models are STILL operating on text as an input and output mode! What if they were multi-modally trained on source code and docs and their syntax / semantic trees? I don't even know what this would look like, but I'd bet this would produce the most accurate coding models ever -- probably neurosymbolic in the truest sense.
When I was reading the Opus 4.6 launch post, they mentioned the same thing and their TerminalBench score was based on using Terminus 2 and not CC.
0. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/
Ultimately the market is going to force them to open up and let people flex their subs.
I’ll probably get downvoted for this, but am I the only one who thinks it’s kind of wild how much anger is generated by these companies offering discounted plans for use with their tools?
At this point, there would be less anger and outrage on HN if they all just charged us the same high per-token rate and offered no discounts or flat rate plans.
Like most things - assume the "20/100/200" dollar deals that are great now are going to go down the enshitification route very rapidly.
Even if the "limits" on them stay generous, the product will start shifting to prioritize things the user doesn't want.
Tool recommendations are my immediate and near term fear - paid placement for dev tools both at the model level and the harness level seem inevitable.
---
The right route is open models and open harnesses, ideally on local hardware.
https://github.com/jahala/tilth
its on npm and cargo:
- cargo install tilth
- npx tilth
then tilth install claude-code/windsurf/cursor --edit
(--edit flag is needed)
I made "tilth" a few days ago, since I'm consistently trying to get the LLMs to use tools more efficiently and spend less tokens doing it -- original tilth post from Monday: https://news.ycombinator.com/item?id=46952321
(Already published on cargo, on npm in a few mins).
Instead of cat + grep + manual line counting, one tool call returns a structural outline of a large file, lets you drill into sections, and since this last update also returns hashline-anchored output that an edit tool can target.
As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).
Worked like a charm.
Implementation: https://github.com/cellux/dotfiles/blob/master/.emacs.d/rb-t...
Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.
Love the pragmatic mix of content based addressing + line numbers. Beautiful.
With CC you can do a /cost to see how much your session cost in dollar terms, that's a good benchmark IMO for plugins, .md files for agents, and so on. Minimize the LLM cost in the way you'd minimize typical resource usage on a computer like cpu, ram, storage etc.
Are they portable bit by bit back to pi, or is there enough differences that they can't? how about normal pi extensions, can they be used in omp?
Some of the stuff definitely looks interesting.
* Subscriptions are oversubscribed. They know how much an “average” Claude Code user actually consumes to perform common tasks and price accordingly. This is how almost all subscription products work.
* There is some speculation that there is cooperative optimization between the harness and backend (cache related etc).
* Subscriptions are subsidized to build market share; to some extent the harnesses are “loss leader” halo products which drive the sales of tokens, which are much more profitable.
I don’t believe it’s exceptionally unique or new that companies will revoke access if you are using an unpublished API that the apps use. I don’t see anything wrong with it myself. If you want, pay for normal token use on the published APIs. There is no expectation that you can use APIs for an application, even if you are a paid user, that are not published explicitly for usage.
It's truly disgusting.
It’s because they want to study you.
They want the data!
Underscores the importance of sovereign models you can run on the edge, finetune yourself, and run offline. At State of Utopia, we're working on it!
Also, nice clever optimization here. Lots of low hanging fruit in harness land.
Back when I was maintaining a coding harness around the time of Claude 3.5 we tried hash prefixes we tried line number prefixes we tried a lot of different approaches to making the model better at selecting edit blocks and ultimately at-least then fuzzy string matching won out.
We got lines-with-anchors working fine as a replacement strategy, the problem was that when you don't make the model echo what it's replacing, it's literally dumber at writing the replacement; we lost more in test failures + retries than we gained in faster outputs.
Makes sense when you think about how powerful the "think before answering" principle is for LLMs, but it's still frustrating
I keep asking myself “could my friends and family be handed this and be expected to build what I’m building on them” and the answer is an immediate “absolutely not”. Could a non technical manager use these tools do build what I’m building? Absolutely not. And when I think about it, it’s for the exact same reason it’s always been… they just aren’t a developer. They just don’t “think” in the way required to effectively control a computer.
LLMs are just another way to talk to a machine. They aren’t magic. All the same fundamental principles that apply to probably telling a machine what to do still apply. It’s just a wildly different mechanism.
That all being said, I think these things will dramatically speed up the pace that software eats the world. Put LLMs into a good harness and holy shit it’s like a superpower… but to get those superpowers unlocked you still have to know the basis, same as before. I think this applies to all other trades too. If you are a designer you still have to what good design is and how to articulate it. Data scientists still need to understand the basics of their trade… these tools just give them superpowers.
Whether or not this assertion remains true in two or three years remains to be seen but look at the most popular tool. Claude code is a command line tool! Their gui version is pretty terrible in comparison. Cursor is an ide fork of vscode.
These are highly technical tools requiring somebody that knows file systems, command lines, basic development like compilers, etc. they require you to know a lot of stuff most people simply don’t. The direction I think these tools will head is far closer to highly sophisticated dev tooling than general purpose “magic box” stuff that your parents can use to… I dunno… vibe code the next hit todo app.
It’s disheartening that programmers are using this advanced, cutting-edge technology with such a backwards, old-fashioned approach.[1]
Code generation isn’t a higher level abstraction. It’s the same level but with automation.
See [1]. I’m open to LLMs or humans+LLMs creating new abstractions. Real abstractions that hide implementation details and don’t “leak”. Why isn’t this happening?
Truly “vibe coding” might also get the same job done. In the sense of: you only have to look at the generated code for reasons like how a C++ programmer looks at the assembly. Not to check if it is even correct. But because there are concerns beyond just the correctness like code gen size. (Do you care about compiler output size? Sometimes. So sometimes you have to look.)
I will still opt for a scriptable shell. A few scripts, and I have a custom interface that can be easily composed. And could be run on a $100 used laptop from ebay.
Instead I now use Damerau-Levenshtein distance to assert the edits to be replaced and if the similarity is over some threshold the edit goes through
Really works well because it's explicit. Forcing the model to emit the source tokens to be replaced seems to improve things.
https://github.com/day50-dev/sidechat/blob/db9c8f9d834967442...
It will often chomp white space differently but the main problem is
1. Track alignment with the lines being tracks (hash fixes that)
2. Content alignment with the model not losing focus (hamming/levenshtein other similarity scores fixes that)
If we demand exact matches we're simply not going to get them.
(Combining both methods might be good, I hadn't thought of that)
Another crucial point: the error line "Content mismatch. Reread the file" is crucial. Errors should give descriptive remediate actions.
So even with crappy models it does this automatically and will tool loop accordingly.
Asking it to do smaller edits is no good. Many smaller models will go down to single line edits, looking around for blank lines and just inject garbage. So don't suggest it.
Larger models, which succeed in doing this, know to do that. Smaller models which don't, won't do it if you don't suggest it
Seriously this thing works with 4B models
I also combine it with a toolcall hack for models that don't support tool calling
https://github.com/day50-dev/sidechat/blob/db9c8f9d834967442...
It injects the tool description in the system prompt after probing the capabilities and then does a simple response router.
I haven't found a model within reason that this doesn't work with (I'm sure if you intentionally throw some fine tune botch up that's emitting garbage it'll break - that's not the claim)
YMMV, works for me™
Problem is, replace has been around for so long, most LLMs are tuned for it now
In this era, we should build these kinds of tools for problems we know are straightforward ones you can’t get smarter than, even as intelligence continues to advance. Using tools like "bash" or command-line interfaces originally designed for humans is a good initial approach, since we can essentially reuse much of what was built for human use. Later, we can optimize specifically for machines, either accounting for their different cognitive structures (e.g., the ability to memorize extremely long contexts compared to humans) or adapting to the stream-based input/output patterns of current autoregressive token generators.
Eventually, I believe machine intelligence will build their own tools based on these foundations, likely a similar kind of milestone to when humans first began using tools.
For them I think it would be optimal to provide a tag per function and trust the llm to rewrite the function. As the article notes full reproduction is generally more reliable than edited for short code.
The token and attention overhead from a per line hash I suspect limits this approach for smaller models
[1]: the README.md describes the Context Bonsai features in my fork here: https://github.com/Vibecodelicious/opencode
I just saw a [paper](https://arxiv.org/pdf/2602.05447) that investigated similar aspects of TOON (which aims to reduce JSON tokens), and they found that even though TOON itself reduced the number of tokens, LLMs were less familiar with it, and thus spent even more tokens trying to decipher it, or making mistakes (see section 4.5, figures 6 and 7).
From the paper: >Unlike Markdown, where each grep hit simply returned more text, TOON's overhead was driven by a combination of output density and additional tool calls from pattern unfamiliarity
---- There's a strangeness tax with LLMs, and it can be substantial.
I would not be surprised at all if this technique turned out to be only a local minimum, with detrimental global effects.
The post’s framing is right but undersells what the harness actually does in production. It’s your trust layer: what can the model touch, what can’t it, how cheaply do you recover when it gets something wrong. We spend something like 70% of engineering time on the recovery path, not the inference. Whether that ratio is right I’m not sure, but it’s where we’ve ended up.
On MCP overhead downthread: real, yes. In regulated environments you need the audit trail and the kill switch, and a tool boundary is how you get those. The unsolved part is keeping the protocol thin enough that you’re not burning tokens on ceremony.
I remember some papers about earlier models having around 15% prompt variability, and with different tool use sometimes there are even more significant jumps. And if I remember correctly the reasoning models improve some of these because lot of the early prompting tricks is included in them like "thinking step-by-step", "think carefully" and some other "magic" methods. Also another trick is to ask the models to rephrase the prompt with their own words because that may produce prompt that better align with their training prompts. For sure the big model developers are aware of these and constantly improving it, I just don't see too much discussion or numbers about it.
Even a sub-par LLM, put into a context where it has access to unix tools and network and files etc, is vastly more capable than the best LLM chatbot.
If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.
Someone has to do the baseline training, development, and innovation. it can't be clones all the way down
I see a lot of evidence to the contrary though. Anyone know what the underlying issue here is?
Like a good programming language, a good harness offers a better affordance for getting stuff done.
Even if we put correctness aside, tooling that saves time and tokens is going to be very valuable.
It's completely understandable that prompting in better/more efficient means would produce different results.
> it can use code-centred tools like find_symbol, find_referencing_symbols and insert_after_symbol.
I feel I want to write my own and that maybe in the future a lot of developers will have custom harnesses and have highly customized versions as each user of these models wants to use these things in a way that's unique to their brain, much like how emacs is so great for the customization but one persons emacs config is often not what another wants or only wants a subset and then write their own features.
As an aside what is the feeling on all the various ai coding tools, does aider suck is that aider-ce/cecli are better or are the bespoke tools for each model like claudeCode and such better.
2) AFAIK the $20/month plan allows use of more tokens per month than if you bought $20 of tokens. my understanding is it assumes most users will only use a fraction of that each month, and they rake in profit (like a gym membership)
With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.
Have you tested followup edits on the same files?
You probably don't want to use the line number though unless you need to disambiguate
But your write tool implementation can take care of that
Edit
Checking ohmypi The model has access to str replace too so this is just a edit till
Would also be worth having special tokens for this kind of navigation.
Seeing how bad the results are when you're casually approaching something makes it very evident that it's a topic that can be optimized.
So, the challenge is actually to find a map of "problem" to "author" and then from "author" to "related code" and from their to a solution.
Companies are still stuck in this mindset conflating software engineering with puzzle-solving. This is evident from their job interviews and also these LLM benchmarks.
"You're absolutely right!"
At this point I'd take a contract with Anthropic to have Claude code pick better tooling.
Is it possible that burning extra tokens is the point, since they get paid more?
I'd love to use a different harness-- ideally an OSS one-- and hook it up to whichever LLM provides the best bang for the buck rather than being tied to Claude.
Is anyone else worried at how easily Anthropic/Google/OpenAI can basically cut you off if you do something they don't like?
Yeah, had that thought here a few weeks ago on HN after reading about someone getting cut off from Claude:
https://news.ycombinator.com/item?id=46723384#46728649
Though tbh I'm far more worried about the societal impacts of large scale job displacement across so many professional industries at the same time.
I think it is likely to be very, very ugly for society in the near term. Not because the problems are unsolvable, but because everyone is choosing to ignore the threat of them.
And I realize a lot of people will handwave my concerns away with stories of Luddites and Jevon's paradox, but we've never had a tidal wave this big hit all at once and I think the scale (combined with speed of change) fundamentally changes things this time.
If that 40% is automated away in one go, there's no economy as we know it anymore. Either it acts as a negative void coefficient and moderates it into something sustainable, or it blows up.
It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.
Over a year ago had a lot of issues and the description and example was the difference between 30-50% failure to 1%!
So I'm surprised a bit about the point. May be I'm missing it.
It took me some time to realise what people mean by it, originally confusing it with harvest.
If you google an image of it, maybe it’ll make sense
To my thinking, to orchestrate or steer suggests a conductor or driver, an outside entity providing direction. A master agent creating and directing subagents could reasonably be called an orchestrator.
A harness is what the horse wears to pull a cart, or what connects a pilot to a parachute and provides the controls to tug on and steer. It might provide guidance or capability, but not active direction. It's also a fairly common use in hardware ( a wire harness) and software (a testing harness) already.
How about Kimi tho how can I play with it?
It had some bugs around whitespace replacements but the model seems happy with it now.
Thanks and keep it up! Shoutout to @jahala for tilth as well.
context: created hypertokens an even more robust hashing mechanism to create context-addressable memory (CAM), one cheat code is make them prefix-free, lots of others that get deep into why models work the way they do, etc.
Models have improved dramatically even with the same harness
The way edits happen is that the agent (local) first tells the model (typically remote) that it has an edit tool (e.g. taking parameters file name, find string and replace string). If the model decides it wants to edit a file then it'll invoke this edit tool, which just results in a blob of JSON being put in the model's response specifying the edit (filename, etc). The agent then receives the response, intercepts this JSON blob, sees that it is an edit request and does what is asked.
The problem the article is describing is that the edit request (tool invocation) generated by the model isn't always 100% accurate. Even if the agent told the model it had a tool to invoke an actual editor, say sed, assuming the model knew how to use sed, this is still going to fail if the edit request cannot be interpreted literally by the editor (due to being inaccurate).
The trouble is though, because it's all indeterminant slop, every model will break in small ways that you're back to indeterminancy and building a harness ontop of the harness.
Still, <nerd snipe>, there's probably a way to get the local model and arbitrary remote model to agree on how to make a method call. But the only way that will be fruitful if you find a highly reproducible set of tuples within the model's shared space.
Part of the problem though is that tools like Claude Code don't want to assume too much of the environment - that a specific editor is available, or even that it is running on a particular OS. The way it remains platform agnostic and not reliant on specific tools is by only having a dependency on Node.js, which provides file read/write support, so to implement an edit request the agent uses Node.js to read the file, itself implements the edit, then again uses Node.js to create the new updated file.
I'll point out that if you want permission prompts for certain behavior, you have to add that yourself. There's at least one example.
Edit: Just noticed the article's author is using a fork of Pi.
[1]: https://shittycodingagent.ai/
[2]: https://github.com/badlogic/pi-mono/tree/main/packages/codin...
> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.
This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.
read_toc tool:
...
{
"name": "mcp",
"qualified_name": "mcp",
"type": "constant",
"docstring": null,
"content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp",
"is_nested": false
},
{
"name": "handler",
"qualified_name": "handler",
"type": "constant",
"docstring": null,
"content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",
"is_nested": false
},
....update_content tool:
{
"content": "...",
"content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler",
"project_root": ....
}The VC economics are creating a reality distortion field where Anthropic is incentivized to burn more tokens so they can rent more GPUs so they can get more investment, and where I am incentivized to pipe the LLM inputs into `claude -p` and blast 50KB of useless proompt onto it so they don't ban me from their 95% discounted API endpoint.
If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.
>re "only" the harness changed
In our experience, AI's are like amnesiacs who can barely remember what they did three minutes ago (their last autonomous actions might still be in their context if you're lucky), with no chance at remembering what they did three days ago. As such, the "harness" determines their entire memory and is the single most important determinant of their outcome.
The best harness is a single self-contained, well-commented, obvious, and tiny code file followed by a plain explanation of what it does and what it's supposed to do, the change request, how you want it to do it (you have to say it with so much force and confidence that the AI is afraid of getting yelled at if they do anything else) and a large amount of text devoted to asking the AI not to break what is already working. Followed by a request to write a test that passes. Followed by asking for its judgment about whether it broke what was already working on or not. All in one tiny crisp prompt.
With such a harness, it's able to not break the code one time in twenty. If you use reverse psychology and ask it to do the opposite of what you want, it rises to fifty-fifty odds you'll get what you're trying to do.
Don't believe me? You can watch the livestream (see my previous comments).
Baby steps toward Utopia.