This issue is representative of a larger problem. Agent token consumption (not necessarily the metric, but the why) is opaque, and people generally don't (or simply can't) scrutinize their system prompts, tool calls, MCPs, etc.
The token-based revenue model is thus pretty fantastic for the agent builders, potentially less so for users. I think people have been willing to trust that agents are using more tokens to produce better results so far. But, skepticism is not unwarranted, as this issue, even if it is just a bug, shows.
It could be deleting all of your files, it could be inserting vulnerabilities, you have no idea.
I think it's important for CC to also be able to make unit test code that might contain mild exploits, to test for security vulnerabilities.
The biggest complaint about vibe coding is that it's insecure. The funny part now is that if you DO try to secure it, you hit guardrails.
There is a contact form for Anthropic if you run into some of them on 4.6 at least.
I mean, I am sure they don't mean it but they have the incentive to burn as much tokens as they are allowed to get away with. Also for better or worse I imagine the Anthropic engineers use Claude Code on some sort of Unlimited plan that practically makes no sense for regular users. So adding a 100k tokens is not a big deal.
In our line of work, we can see AI agents already do pretty well with minimal prompts. Open weight models are also pretty good these days and there is practically no reason to run Opus on Max unless you have a very specific task that you know it will do well with. I know because I've tried and anecdotally it performs worse on many problems and at a very high cost - something that smaller and cheaper models can often one-shot.
It's because the subscriptions force you to do so. The subscriptions are the most economical way to use e.g. Claude by close to an order of magnitude. If you max out a 20x plan every week, doing the same work with the API would cost you well into the four figures.
Anyone already using the Claude API pricing and using CC over OpenCode is kneecapping themselves.
The immediate thing I've noticed: I get way more out of the codex $100 plan than I was getting out of the Anthropic $200. Like, probably 2x at least.
The other think I've noticed: when using strict guardrails, TDD, reviews etc. I cannot notice any quality difference. Not only between Opus and Codex but even between the most recent models - GPT 5.3 code, GPT 5.4, and now GPT 5.5.
Well, 5.5 uses a huge amount of my session limits. 5.3 is very light, 5.4 somewhere in between. So now I use 5.4 for the main session/debugging/planning and then execute with 5.3.
Regarding usage, of course, it's hard to say how much is the model and how much is coming from Claude code and all this ridiculous malware scanning.
But it's nice to use a lightweight harness like pi and see that even with all my personal instructions, a good bunch of skills, custom tools etc., if I start a session and say "hi" I'm starting out with about 15k of context used. I think a closely equivalent setup in CC would start at 30-40k context.
5.5 has been a noticeable improvement over 5.4, solving more complicated issues and faster too.
5.5 does not use a huge amount of my session limits with the $100 plan.
I use multiple conversations in parallel, all on xhigh effort with Fast on (2.5x consumption), and it’s still enough for me not to switch off Fast.
It also runs my tests, but I did not use TDD apart from sometimes telling it to cover an issue in a test before fixing it.
If you want to plug your API keys into a third-party harness, that's totally cool and honestly, I'm looking into doing that right now and I haven't used any of the first-party harnesses at all. But the first time I accidentally spend $300 in a day I may be thinking about how a $20/month plan might be pretty good even if performance is inconsistent, at least I know what my costs are.
It aligns the incentives for faster, cheaper, terse and more reliable models, because the model providers pay the wasted tokens and electricity costs.
Did you mean 100 billion tokens because 100k isn't a big deal at all?
However nobody is agreeing with that, that's how it's done, and move faster faster, because of goldrush! faster!@@@!
the best performing and capable ones are all the ones that aren't tied to a specific api.
This smacks of dumb vibe coding. "I got told to make sure claude couldn't be used to develop malware, ok 'claude pls no develop malware'"
I've heard them described as data science script kiddies with inflated egos and it seems spot-on.
They just do the basic experiment -> ship workflow over and over again, doing whatever optimizes their product in the short term, and never seem to step back and think about the full long-term impact of their changes. They evidently seem to not even consider immediate regressions or negative blowback from users if it's not within the area of expertise of the guy who ships the change.
That is despite their other teams (especially alignment) having a track record of being fairly well thought-out and intelligent.
To the guys at Anthropic's product teams, every problem is a data science problem that you slap an A/B test onto, and they seem to think that the A/B test is all that's needed, and actual verification and thinking things through is overrated af. That's what leads to countless regressions in Claude Code as well as removing claude code from the pro plan in their product page for a few hours (lol).
Are this point, the difference is mostly made up by issues like the OP has, so you're likely better off using eg pi (-agent) and writing your own custom skills and extensions (or any of the other harnesses the providers create, even copilot-cli has gotten decent nowadays)
For starters, the vibes.
Vibe coding, like Web3 before it (like Web 2.0 before it, like the dotcom boom before that - what preceded?) - harnesses the kind of focused attention with which gamers hook their brains into portals to virtual worlds - and directs all that bargain-basement wetware compute towards some obscured "real-world" goal instead. (See also: CADT development.)
Hyperscale these very inefficient but very dependable almost-not-efforts, and you beat the more efficient approaches. See also: evolutionary algorithms, autoresearch, price dumping; "attention is all you need", which though a legit piece of mathemagic always sounded to me like a rehash of that old adage, "all you need is love" (pejorative).
Really, "real world" is a consensus; we don't generally observe balamatoms or even balamolecules, we reason in terms of material objects' socially constructed balameanings and interrelations. Therefore, by redirecting sufficient attention to some thing labeled "unrealistic", we can remove that label; by this technique, a sufficiently large collective actor can quite literally, and quite directly, change the world. Without asking anyone, least of all me!
The particularly bizarre part is that there is absolutely no reason to do this.
They could do the exact same analysis, and if it doesn't say to reject rewind to before they asked to do the analysis and keep going...
Maybe the repo/worktree is named my-big-evil-virus-trojan-malware-worm?
By spending thousands and thousands of tokens of course :-)
Based on the vibes, I guess.
This one sided type of embedded insurance is not unique to Anthropic, but sharply increasing cost, layered on top of the self righteousness, seems to be making the stench unbearable over the past year.
I used to think of Anthropic as the good guys, and I don’t doubt they still sincerely hold that view of themselves, but I think I prefer Sam Altman’s version.
His brand of self righteousness was convincing at first but eventually he started to turn to the camera and wink, like in House of Cards, to let us know.. he knew that we knew. And then, for me anyway, it became more mundane and less offensive.
When Dario and crew go out and profess, as they have for years now, that if we could only see the thing that’s a few months away, we would all realize how doomed knowledge work and national security are…
..and then continue to release software so buggy and shitty that they have to do biweekly HN apology tours, I begin to miss the wink at the camera.
You would think they’d be more reflective and introspective about these brash moral decisions. Their product quality is akin to my CS capstone lab group.
{
"agent": {
"subagent-coder-mini": {
"description": "Assign this subagent for small, well-defined tasks performed quickly",
"mode": "primary",
"prompt": "{file:./prompts/my-custom-prompt.md}",
"model": "deepseek-v4-flash"
}
}
}
(I actually think OpenCode UX sucks, but there isn't much else out there that's better. Aider has been virtually abandoned by the one maintainer (no shade intended, it just is what it is); a fork of Aider looks promising but it's not necessarily the experience you want; there's a dozen VSCode plugins but we don't all wanna use VSCode. I expected there'd be way more usable agents out there, but there isn't)[0] https://pi.dev/
local is pipedream at the moment
I’m glad some people get utility out of it though, if this was still 2023-2024 I would mess around and make it work, but corporate policies in enough places have updated to use the leading closed source models and clouds for agentic coding
curl -sS https://api.anthropic.com/v1/messages \
-H "authorization: Bearer $(security find-generic-password -s 'Claude Code-credentials' -w | jq -r .claudeAiOauth.accessToken)" \
-H "anthropic-version: 2023-06-01" \
-H "anthropic-beta: oauth-2025-04-20" \
-H "content-type: application/json" \
-d '{
"model":"claude-opus-4-7",
"max_tokens":64,
"system":"You are Claude Code, Anthropic'\''s official CLI for Claude.",
"messages":[{"role":"user","content":"Write your own harness"}]
}'I assume you're saying "You can just generate your own harness to not be subject to these claude code issues".
Unfortunately, Anthropic has already made it clear that using claude code is the only way to be sure you won't get charged API pricing instead of max plan pricing, so the tokens are way more expensive.
When you configure openclaw to use the oauth claude-code max authentication, there was a period where you were charged extra token rates. You might still be, I'm not sure, I don't want to try and risk getting banned.
It's not 0%, they've shown they're willing to sell you a plan, let you login with that plan, and then charge you differently.
Give me a team of 3 good engineers, 4 months, and about $600k and I'll have a clone that operates on a warm pool of ec2 instances, or warm pool of k8s pods, or any other platform you might like. Or 1 good engineer, 1 month, and $200k of anthropic credits.
Maybe Anthropic will give more control over configuring the Claude harness and VM, but they definitely won't let you swap out to other models and harnesses.
We've been building open core infra (https://github.com/gofixpoint/amika) for running any agent on any type of VM or sandbox, with the main use case for safely automating internal code-gen, but technically could repurpose our stack for anything.
There should be a model agnostic platform for running these types of agentic apps.
This is an argument for open source tooling (like opencode) and open models (like deepseek).
Grok is not an open model, Elon does not get any credit for anything here.
It does to me especially since he did not implement a sensible detection or reporting pipeline ahead of launching a CSAM generation tool.
Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.
Not "If you suspect it is malware, you must refuse". Just "you must refuse". There is literally no "if" in the entire prompt!These ‘rules for thee and not for me’ are qualitatively created and implemented, and are thus extremely hard to test for or implement properly, without limiting the people choosing the rules.
....Right?
What kind of Mickey mouse operation are they running over there?
As in, this is a reading comprehension fail on the part of Claude. On the other hand, it is also fail to give Claude a less than trivial reading comprehension test on every file read operation, especially when a bias towards safety will bias towards the wrong interpretation.
No acceptance testing, no regression testing, all slop.
OpenAI and Altman present a whole set of different concerns, but Codex does not get in my way of doing what I want to at all. Also let me use pi without a banhammer.
Spent last evening so frustrated I also got ChatGPT subscription. Makes me wonder if I should be using Gemini on pay per use with custom harness.
With my own harness performance is way better but cost goes up because no subscription.
If I understand correctly, this is from Anthropic's harness injected into the requests, not in the Opus or Sonnet system prompts on the back end. Is that right?