To me, it is hard to reject this hypothesis today. The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs. Their gross margins in this past quarter will be an important data point on this.
I think the tendency for graphs of model assessment to display the log of cost/tokens on the x axis (i.e. Artificial Analysis' site) has obscured this dynamic.
I think we're reaching the point where more developers need to start right-sizing the model and effort level to the task. It was easy to get comfortable with using the best model at the highest setting for everything for a while, but as the models continue to scale and reasoning token budgets grow, that's no longer a safe default unless you have unlimited budgets.
I welcome the idea of having multiple points on this curve that I can choose from. depending on the task. I'd welcome an option to have an even larger model that I could pull out for complex and important tasks, even if I had to let it run for 60 minutes in the background and made my entire 5-hour token quota disappear in one question.
I know not everyone wants this mental overhead, though. I predict we'll see more attempts at smart routing to different models depending on the task, along with the predictable complaints from everyone when the results are less than predictable.
So there's a push for them to increase revenue per user, which brings us closer to the real cost of running these models.
Like, Apple computers are already quite pricey -- $1000 or $2000 or so for a decent one. But you can spec up one that’s a bit better (not really that much better) and they’ll charge you $10K, $20K, $30K. Some customers want that and many are willing to pay for it.
Is there an equivalent ultra-high-end LLM you can have if you’re willing to pay? Or does it not exist because it would cost too much to train?
The goal of every company that needs to make ever more money for investors is to earn more money while spending less. There are many ways of doing this without reducing the quality of the product, e.g. using less staff to do more, getting more compute out of same the energy, using cheaper or free energy, optimizing algorithms in ways that do not degrade quality or you grow because you gain more customers and break into new markets etc. And once you made all these optimizations and the market is saturated, then the only optimizations left are the ones where the quality goes down or the risk is increased. Quality in that sense, is what you can get away with without customers jumping ship. So you will also work on locking customers in and make jumping ship look very hard and complicated.
But it ignores completely the fact that the same intelligence is dropping by an order of magnitude (at least) every 12 months.
GPT o1 launched at $600/M output tokens and GPT4.5 launched at $150/M.
Opus 4.7 is $25/M for more intelligence
Or they are just not willing to burn obscene levels of capital like OpenAI.
The key question is how well it a given model does the work, which is a lot harder to measure. But I think token costs are still an order of magnitude below the point where a US-based developer using AI for coding should be asking questions about price; at current price points, the cost/benefit question is dominated by what makes the best use of your limited time as an engineer.
We already shipped 3 things this year built using Claude. The biggest one was porting two native apps into one react native app - which was originally estimated to be a 6-7 month project for a 9 FTE team, and ended up being a 2 months project with 2 people. To me, the economic value of a claude subscription used right is in the range of 10-40k eur, depending on the type of work and the developer driving it. If Anthropic jacked the prices 100x today, I'd still buy the licenses for my guys.
Edit: ok, if they charged 20k per month per seat I'd also start benchmarking the alternatives and local models, but for my business case, running a 700M budget, Claude brings disproportionate benefis, not just in time saved in developer costs, but also faster shipping times, reduced friction between various product and business teams, and so on. For the first time we generally say 'yes' to whichever frivolities our product teams come up with, and thats a nice feeling.
They are talking about every IC becomes an EM, managing teams of agents.
Did you see the leak of Meta’s token consumption? That’s waaay more than you can get for a small $200 a month plan.
Also, I think the $200/mo plan is subsidized by VC money and is likely hemorrhaging money for Anthropic, so it's not really meaningful to reason around that.
IF they can be shown to be multiplying developer productivity (completing more projects on time, without reduction in quality and associated costs) by some significant amount then they are providing value at current cost, but it's not at all clear whether that is in fact the case, especially since most of the claims of productivity are anecdotal and/or based on things like LOC generated rather than delivered functionality.
Meta's "token usage leaderboard" shows how far some companies are from measuring anything meaningful! It'd be exactly like some company in the .com era measuring employee's "productivity" by how many bytes they'd downloaded from the internet each day (even if that was just a cat video). "Woo hoo, we're out-internetting you! Our internet bill is enormous!" (then proceeds to fire the guy coding, and gives a bonus to the one downloading cat videos).
There have been some studies/polls done indicating that some very high percentage (90%?) of corporate AI projects are failing. Why is this? Are they ill-conceived, and or ill-executed? Is it the quality of what's being produced that is causing these projects to be abandoned and/or considered as a failure?
There have also been some separate studies indicating programmer productivity to be reduced, not increased, by use of AI coding tools, which is easy to understand. The developer struggles with the tool and it's fallibilities, eventually gets it to generate something that works, then closes his JIRA story with an "AI coded" tag (which shows up on the boss's dashboard, and is all that he sees). Was this an AI productivity success story? To the boss perhaps, but not if the developer admits that it would have just been faster to do it the old way by hand or cut-n-paste from stack overflow.
The $200/month is an individual subscription.
It is like comparing an 8K display to a 16K display because at normal viewing distance, the difference is imperceptible, but 16K comes at significant premium.
The same applies to intelligence. Sure, some users might register a meaningful bump, but if 99% can't tell the difference in their day-to-day work, does it matter?
A 20-30% cost increase needs to deliver a proportional leap in perceivable value.
This is also why I don't see the models getting commoditized anytime soon - the dimensionality of LLM output that is economically relevant keeps growing linearly for coding (therefore the possibility space of LLM outputs grows exponentially) which keeps the frontier nontrivial and thus not commoditized.
In contrast, there is not much demand for 100 page articles written by LLMs in response to basic conversational questions, therefore the models are basically commoditized at answering conversational questions because they have already saturated the difficulty/usefulness curve.
For coding though, there is kind of no limit to the complexity of software. The more invariants and potential interactions the model can be aware of, the better presumably. It can handle larger codebases. Probably past the point where humans could work on said codebases unassisted (which brings other potential problems).
It's not necessary a single discrete point I think. In my experience, it's tied to the quality/power of your harness and tooling. More powerful tooling has made revealed differences between models that were previously not easy to notice. This matches your display analogy, because I'm essentially saying that the point at which display resolution improvements are imperceptible matters on how far you sit.
And it's not that they "don't notice" it's that they physically can't distinguish finer angular separation.
Until it's making 100k decisions a day and many are dependent on previous results.
I was always wondering where that breaking point for cost/peformance is for displays. I use 4K 27” and it’s noticeably much better for text than 1440p@27 but no idea if the next/ and final stop is 6k or 8k?
You mean a couple of years ago?
https://docs.github.com/fr/copilot/reference/ai-models/suppo...
At 7.5x for 4.7, heck no. It isn't even clear it is an upgrade over Opus 4.6.
The final calculation assumes that Opus 4.7 uses the exact same trajectory + reasoning output as Opus 4.6. I have not verified, but I assume it not to be the case, given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.
"given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.”
Opus 4.7 in general is more expensive for similar usage. Now we can argue that is provides better performance all else being equal but I haven’t been able to see that
https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...
I think a big issue with the industry right now is it's constantly chasing higher performing models and that comes at the cost of everything else. What I would love to see in the next few years is all these frontier AI labs go from just trying to create the most powerful model at any cost to actually making the whole thing sustainable and focusing on efficiency.
The GPT-3 era was a taste of what the future could hold but those models were toys compare to what we have today. We saw real gains during the GPT-4 / Claude 3 era where they could start being used as tools but required quite a bit of oversight. Now in the GPT-5 / Claude 4 era I don't really think we need to go much further and start focusing on efficiency and sustainability.
What I would love the industry to start focusing on in the next few years is not on the high end but the low end. Focus on making the 0.5B - 1B parameter models better for specific tasks. I'm currently experimenting with fine-tuning 0.5B models for very specific tasks and long term I think that's the future of AI.
If you can forgive the obviously-AI-generated writing, [CPUs Aren't Dead](https://seqpu.com/CPUsArentDead) makes an interesting point on AI progress: Google's latest, smallest Gemma model (Gemma 4 E2B), which can run on a cell phone, outperforms GPT-3.5-turbo. Granted, this factoid is based on `MT-Bench` performance, a benchmark from 2023 which I assume to be both fully saturated and leaked into the training data for modern LLMs. However, cross-referencing [Artificial Analysis' Intelligence Index](https://artificialanalysis.ai/models?models=gemma-4-e2b-non-...) suggests that indeed the latest 2B open-weights models are capable of matching or beating 175B models from 3-4 years ago. Perhaps more impressive, [Gemma 4 E4B matches or beats GPT-4o](https://artificialanalysis.ai/models?models=gemma-4-e4b%2Cge...) on many benchmarks.
If this trend continues, perhaps we'll have the capabilities of today's best models available to reasonably run on our laptops!
I personally think the whole "the newest model is crazy! You've gotta use X (insert most expensive model)" Is just FOMO and marketing-prone people just parroting whatever they've seen in the news or online.
Surely you can see the first lab that solves this gains a massive advantage?
Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?"
Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified.
I'll look at the new models, but increasing the token consumptions by a factor of 7 on copilot, and then running into all of these budget management topics people talk about? That seems to introduce even more flow-breakers into my workflow, and I don't think it'll be 7 times better. Maybe in some planning and architectural topics where I used Opus 4.6 before.
How do you have an opinion on 4.6/4.7 here? It's less clear but I could totally see that 4.7 or beyond leads to project completion 20% faster, by removing dead ends, foot guns, less backtracking, etc.
How to tell / measure effectively? No clue.
Human psychology is surprisingly similar, and same pattern comes across domains.
I didn't buy Springles chips in years, even the box now is nothing like it was. Thinner. Shorter. I imagine how far from the top the slices stack up.
> max: Max effort can deliver performance gains in some use cases, but may show diminishing returns from increased token usage. This setting can also sometimes be prone to overthinking. We recommend testing max effort for intelligence-demanding tasks.
> xhigh (new): Extra high effort is the best setting for most coding and agentic use cases
Ref: https://platform.claude.com/docs/en/build-with-claude/prompt...
I'm already at 27% of my weekly limit in ONE DAY.
it seems to hallucinate a bit more (anecdotal)
Ouch, that's very different than experience. What effort level? Are you careful to avoid pushing session context use beyond 350k or so (assuming 1m context)?
Today we are almost on non-speaking terms. I'm asking it to do some simple stuff and he's making incredible stupid mistakes:
This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?
and at the same time the compacting is firing like crazy. (What adds ~4 minute delays every 1 - 15 minutes) | # | Time | Gap before | Session span | API calls |
|---|----------|-----------|--------------|-----------|
| 1 | 15:51:13 | 8s | <1m | 1 |
| 2 | 15:54:35 | 48s | 37m | 51 |
| 3 | 16:33:33 | 2s | 19m | 42 |
| 4 | 16:53:44 | 1s | 9m | 30 |
| 5 | 17:04:37 | 1s | 17m | 30 |
# — sequential compaction event number, ordered by time.
Time — timestamp of the first API call in the resumed session, i.e. when the new context (carrying the compaction summary) was first sent to the
model.
Gap before — time between the last API call of the prior session and the first call of this one. Includes any compaction processing time plus user
think time between the two sessions.
Session span — how long this compaction-resumed session ran, from its first API call to its last before the next compaction (or end of session).
API calls — total number of API requests made during this resumed session. Each tool use, each reply, each intermediate step = one request.
Bottomline, I will probably stay on Sonnet until they fix all these issues.I don't know if you're giving this as something you've actually given Claude, but I don't think it's a good way of using Claude.
It's not a collaborator who's having a bad day where a little empathy might make him feel better and realize his error. It's a token generator based on a prompt which includes all chat history. If you have three examples of the bad approach in the history, in a format that looks like Claude doing work, it will totally pollute it! And even worse with auto-compaction where you don't know exactly what of those false starts is getting summarized into its context.
You have to treat this like a tool and understand how it works.
If Claude is going down a wrong path it's better to cancel and rewind and improve the previous addition to the prompt. You don't want it to generate a bunch of misleading tokens for itself and leave it in the context window indefinitely!
Claude and other LLMs do not have a gender; they are not a “he”. Your LLM is a pile of weights, prompts, and a harness; anthropomorphising like this is getting in the way.
You’re experiencing what happens when you sample repeatedly from a distribution. Given enough samples the probability of an eventual bad session is 100%.
Just clear the context, roll back, and go again. This is part of the job.
You should just revert the context and provide more detail and rationale in the message.
> You're right, that was a shit explanation. Let me go look at what V1 MTBL actually is before I try again.
> Got it — I read the V1 code this time instead of guessing. Turns out my first take was wrong in an important way. Let me redo this in English.
:facepalm:
https://matrix.dev/blog-2026-04-16.html (We were talking to Opus 4.7 twelve days ago)
Recently it started promoting me for feedback even though I am on API access and have disabled this. When I did a deep dive of their feedback mechanism in the past (months ago so probably changed a lot since then) the feedback prompt was pushing message ids even if you didn't respond. If you are on API usage and have told them no to training on your data then anything pushing a message id implies that it is leaking information about your session. It is hard to keep auditing them when they push so many changes so I am now 'default they are stealing my info' instead of believing their privacy/data use policy claims. Basically, my level of trust is eroding fast in their commitment to not training on me and I am paying a premium to not have that happen.
You're offended by their political beliefs, so you don't like the way the model works?
Looks like they lost the mandate of heaven, if Open AI plays it right it might be their end. Add to that the open source models from China.
When I read these comments on Hacker News, I see a lot of people miffed about their personal subscription limits. I think this is a viewpoint that is very consumer focused, and probably within Anthropic they're seeing buckets of money being dumped on them from enterprises. They probably don't really care as much about the individual subscription user, especially power users.
2. Anthropic and OpenAI's financials are totally different. The former has nearly the same RRR and a fraction of the cash burn. There is a reason Anthropic is hot on secondary and OAI isn't
And then it proceeded to rewrite the block with a dict lookup plus if-elses, instead of using match/case. I had to nag it to actually rewrite the code the way it said it would!
Again, the article's numbers are likely a rather crude approximation, but taking 85% accuracy (claude 4.6) vs 90% (4.7) as inputs:
4.6 1 iteration 85%
4.7 1 iteration 90%
4.6 5 iterations 44.37%
4.7 5 iterations 59.85%
4.6 10 iterations 19.69%
4.7 10 iterations 34.87%
Compounded, small improvements really move the needle downstream. 1.4x doesn't seem worth it for 5% better, but 10 calls in, that's more than a 40% improvement.I would rather steer quickly, get ideas because I'm moving quickly, do course correction quickly - basically I'm not happy blocking my chain of thought/concentration and fall prey to distractions due to Claude's slowness and compaction cycles. Sometimes I don't even notice that Codex has compacted.
For architectural discussions, sure I'll pick Claude. I'm mentally prepared for that. But once we are in the thick of things, speed matters. I would they rather focus on improving Sonnet's speed.
This is already becoming apparent as users are seeing quality degrade which implies that anthropic is dropping performance across the board to minimize financial losses.
I am finding that for complex tasks, Claude's quality of output varies _tremendously_ with repeated runs of the same model and prompt. For example, last week I wrote up (with my own brain and keyboard) a somewhat detailed plain english spec of a work-related productivity app that I've always wanted but never had the time to write. It was roughly the length of an average college essay. The first thing I asked Claude to do was not write any code, but come up with a more formal design and implementation plan based on the requirements that I gave. The idea was to then hand _that_ to Claude and say, okay, now build it.
I used Opus 4.6 with High reasoning for all of this and did not change any model settings between runs.
The first run was overall _amazing_. It was detailed, well-written, contained everything that I asked for. The only drawback was that I was ambiguous on a couple of points which meant that the model went off and designed something in a way that I wasn't expecting and didn't intend. So I cleared that up in my prompt, and instead of keeping the context and building on what was already there, I started a new chat and had it start again from scratch.
What it wrote the second time was _far_ less impressive. The writing was terse, there was a lot less detail, the pretty dependency charts and various tables it made the first time were all gone. Lots of stuff was underspecified or outright missing.
New chat, start again. Similar results as the second run, maybe a bit worse. It also started _writing code_ which was something I told it NOT to do. At this point I'm starting to panic a little because I'm sure I didn't add, "oh, and make it crappy" to the prompt and I was a little angry about not saving the first iteration since it was fairly close to what I had wanted anyway.
I decided to try one last time and it finally gave me back something within about 95% of the first run in terms of quality, but with all the problems fixed. So, I was (finally) happy with that, and it used that to generate the application surprisingly well, with only a few issues that should not be too hard to fix after the fact.
So I guess 4th time was a charm, and the fare was about $7 in tokens to get there.
Except, it's not that trivial to solve. I tried experimenting with asking the model to first give a list of symbols it will modify, and then just write the modified symbols. The results were OK, but less refined than when it echoes back the entire file.
The way I see it is that when you echo back the entire file, the process of thinking "should I do an edit here" is distributed over a longer span, so it has more room to make a good decision. Like instead of asking "which 2 of the 10 functions should you change" you're asking it "should you change method1? what about method2? what about method3?", etc., and that puts less pressure on the LLM.
Except, currently we are effectively paying for the LLM to make that decision for *every token*, which is terribly inefficient. So, there has to be some middle ground between expensively echoing back thousands of unchanged tokens and giving an error-ridden high-level summary. We just haven't found that middle ground yet.
grit.io was working on this years ago, not sure if they are still alive/around, but I liked their approach (just had a very buggy transformer/language).
I thought coding harnesses provided tools to apply diffs so the LLM didn't have to echo back the entire file?
And if it's not good enough for coding, what kind of money, if any, would make it good enough?
Do yourself a favor: Set up OpenCode and OpenRouter, and try all the models you want to try there.
Other than the top performers (e.g. GLM 5.1, Kimi K2.5, where required hardware is basically unaffordable for a single person), the open models are more trouble than they're worth IMO, at least for now (in terms of actually Getting Shit Done).
Many providers out there host open weights models for cheap, try them out and see what you think before actually investing in hardware to run your own.
Fun fact: AWS offers apple silicon EC2 instances you can spin up to test.
I took the plan that I used from Codex and handed it to opencode with Qwen 3.5 running locally.
It created a library very similar to Codex but took 2x longer.
I haven't tried Qwen 3.6 but I hear it's another improvement. I'm confident with my AI skills that if/when cheap/subsidized models go away, I'll be fine running locally.
The best bang for the buck now is subcribing to token plans from Z.ai (GLM 5.1), MiniMax (MiniMax M2.7) or ALibaba Cloud (Qwen 3.6 Plus)
Running quantized models won't give you results comparable to Opus or GPT.
Re-ran the bake-off with 4.7 authoring and… gpt5.4 still clearly winning. Same skills, same prompts, same agents.md.
This is not so much about my instructions being followed more closely. It's the LLM being smarter about what's going on and for example saving me time on unnecessary expeditions. This is where models have been most notably been getting better to my experience. Understanding the bigger picture. Applying taste.
It's harder to measure, of course, but, at least for my coding needs, there is still a lot of room here.
If one session costs an additional 20% that's completely fine, if that session gets me 20% closer to a finished product (or: not 20% further away). Even 10% closer would probably still be entirely fine, given how cheap it is.
However, if you are using API costs then I guess you're left holding the bag.
But it looks like it's just creeping up. Probably because we're paying for construction, not just inference right now.
I am clearly missing something but wouldn't this be an ideal thing to do? Surely if it was optimised it would use less tokens while not losing anything from the instructions ?
4.6 performers worse or the same in most of the tasks I have. If there is a parameter that made me use 4.6 more frequently is because 4.5 get dumber and not because 4.6 seemed smarter.
So far it costs a lot less, because I'm not going to be using it.
This was what I thought was my best moat as a senior dev. No other model has been able to come close to the throughput I could achieve on my own before. Might be a fluke of course, and they've picked up a few patterns in training that applies to this particular problem and doesn't generalize. We'll see.
???
please i beg post the prompt and the refusal
I literally can not imagine a model refusing to do something
So the new tokenizer costs for English/code is to support SHOUTING in English?
> In Claude Code, we’ve raised the default effort level to xhigh for all plans.
Try changing your effort level and see what results you get
I find 5 thinking levels to be super confusing - I dont really get why they went from 3 -> 5
claude code on opus continuously = whole bill. different measurement.
haiku 4.5 is good enough for fanout. opus earns it on synthesis where you need long context + complex problem solving under constraints
Commercial inference providers serve Chinese models of comparable quality at 0.1x-0.25x. I think Anthropic realised that the game is up and they will not be able to hold the lead in quality forever so it's best to switch to value extraction whilst that lead is still somewhat there.
"Comparable" is doing some heavy lifting there. Comparable to Anthropic models in 1H'25, maybe.
https://platform.claude.com/docs/en/about-claude/pricing
So if you are generating more tokens, you are eating up your usage faster
Why release this?
Feels like LLMs are devolving into having a single, instantly recognizable and predictable writing style.
People love to throw around "this is the dumbest AI will ever be", but the corollary to that is "this is the most aligned the incentives between model providers and customers will ever be" because we're all just burning VC money for now.
Please say this louder for everyone to hear. We are still at the stage where it is best for Anthropic's product to be as consumer aligned (and cost-friendly) as possible. Anthropic is loosing a lot of money. Both of those things will not be true in the near future.
That's one market segment - the high priced one, but not necessarily the most profitable one. Ferrari's 2025 income was $2B while Toyota's was $30B.
Maybe a more apt comparison is Sun Microsystems vs the PC Clone market. Sun could get away with high prices until the PC Clones became so fast (coupled with the rise of Linux) that they ate Sun's market and Sun went out of business.
There may be a market for niche expensive LLMs specialized for certain markets, but I'll be amazed if the mass coding market doesn't become a commodity one with the winners being the low cost providers, either in terms of API/subscriptions costs, or licensing models for companies to run on their own (on-prem or cloud) servers.
And now maintaining that pace means absorbing arbitrary price increases, shrugged off with “we were operating at a loss anyway.”
It stops being “pay to play” and starts looking more like pay just to stay in the ring, while enterprise players barely feel the hit and everyone else gets squeezed out.
Market maturing my butthole... it’s obviously a dependency being priced in real time. Tech is an utter shit show right now, compounded by the disaster of the unemployment market still reeling from the overhiring of 2020.
save up now and career pivot. pick up gardening.
"Utility" is close, but "energy source" may be closer. When it becomes the thing powering the pace of work itself, raising prices is less about charging for access and more about taxing dependency.