Measuring Claude 4.7's tokenizer costs (opens in new tab)

(claudecodecamp.com)

714 pointsaray071mo ago498 comments

498 comments

204 comments · 85 top-level

louiereederson1mo ago· 13 in thread

LLMs exist on a logaritmhic performance/cost frontier. It's not really clear whether Opus 4.5+ represent a level shift on this frontier or just inhabits place on that curve which delivers higher performance, but at rapidly diminishing returns to inference cost.

To me, it is hard to reject this hypothesis today. The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs. Their gross margins in this past quarter will be an important data point on this.

I think the tendency for graphs of model assessment to display the log of cost/tokens on the x axis (i.e. Artificial Analysis' site) has obscured this dynamic.

louiereederson1mo ago

I meant reference Toby Ord's work here. I think his framing of the performance/cost frontier hasn't gotten enough attention https://www.tobyord.com/writing/hourly-costs-for-ai-agents

2 more replies

Aurornis1mo ago

> It's not really clear whether Opus 4.5+ represent a level shift on this frontier or just inhabits place on that curve which delivers higher performance, but at rapidly diminishing returns to inference cost.

I think we're reaching the point where more developers need to start right-sizing the model and effort level to the task. It was easy to get comfortable with using the best model at the highest setting for everything for a while, but as the models continue to scale and reasoning token budgets grow, that's no longer a safe default unless you have unlimited budgets.

I welcome the idea of having multiple points on this curve that I can choose from. depending on the task. I'd welcome an option to have an even larger model that I could pull out for complex and important tasks, even if I had to let it run for 60 minutes in the background and made my entire 5-hour token quota disappear in one question.

I know not everyone wants this mental overhead, though. I predict we'll see more attempts at smart routing to different models depending on the task, along with the predictable complaints from everyone when the results are less than predictable.

9 more replies

snek_case1mo ago

They're also getting closer to IPO and have a growing user base. They can't justify losing a very large number of billions of other people's money in their IPO prospectus.

So there's a push for them to increase revenue per user, which brings us closer to the real cost of running these models.

3 more replies

ethin1mo ago

I mean, the signs have been there that the costs to run and operate these models wasn't as simple as inference costs. And the signs were there (and, arguably, are still there) that it costs way, way more than many people like to claim on the part of Anthropic. So to me this price hike is not at all surprising. It was going to come eventually, and I suspect it's nowhere near over. It wouldn't surprise me if in 2-3 years the "max" plan is $800 or $2000 even.

1 more reply

iainmerrick1mo ago

That sounds very plausible. But it implies they could offer even higher performance models at much higher costs if they chose to; and presumably they would if there were customers willing to pay. Is that the case? Surely there are a decent number of customers who’d be willing to pay more, much more, to get the very best LLMs possible.

Like, Apple computers are already quite pricey -- $1000 or $2000 or so for a decent one. But you can spec up one that’s a bit better (not really that much better) and they’ll charge you $10K, $20K, $30K. Some customers want that and many are willing to pay for it.

Is there an equivalent ultra-high-end LLM you can have if you’re willing to pay? Or does it not exist because it would cost too much to train?

2 more replies

svantana1mo ago

FWIW, Artificial Analysis has a "Intelligence vs Cost" plot on their front page that shows models' score vs cost to run the benchmark, which should be more fair in this sense. According to that one, Opus 4.7 (max) is slightly cheaper than 4.6 (though still very expensive).

conductr1mo ago

Yeah. Combine this with much of Corpos right now using a “burn as many tokens as you need” policy on AI, the incentive is there for them to raise price and find an equilibrium point or at least reduce the bleed.

amelius1mo ago

Once they implement their models directly in silicon, the cost will come down and the speed will go up. See Taalas.

1 more reply

Lihh271mo ago

heh adaptive thinking is letting the meter run itself. they make more when it runs longer.

atoav1mo ago

For me it was pretty clear from the start that costs will have to increase. It is the classical drug dealer model: first you hook them with cheap supply, maybe even free, then you slowly jack the prize up to a level that can (just) be sustained. Then you decrease the quality of the product by diluting it so you get more bucks for each gram you bought. You could also call it enshittification if you like.

The goal of every company that needs to make ever more money for investors is to earn more money while spending less. There are many ways of doing this without reducing the quality of the product, e.g. using less staff to do more, getting more compute out of same the energy, using cheaper or free energy, optimizing algorithms in ways that do not degrade quality or you grow because you gain more customers and break into new markets etc. And once you made all these optimizations and the market is saturated, then the only optimizations left are the ones where the quality goes down or the risk is increased. Quality in that sense, is what you can get away with without customers jumping ship. So you will also work on locking customers in and make jumping ship look very hard and complicated.

nl1mo ago

This is a bad take. It's not really wrong in the sense that yes higher performance does cost more.

But it ignores completely the fact that the same intelligence is dropping by an order of magnitude (at least) every 12 months.

GPT o1 launched at $600/M output tokens and GPT4.5 launched at $150/M.

Opus 4.7 is $25/M for more intelligence

jimiljojo1mo ago

What a well thought and written comment. I totally agree.

1 more reply

paulddraper1mo ago

> The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs.

Or they are just not willing to burn obscene levels of capital like OpenAI.

1 more reply

tabbott1mo ago· 9 in thread

I find it interesting that folks are so focused on cost for AI models. Human time spent redirecting AI coding agents towards better strategies and reviewing work, remains dramatically more expensive than the token cost for AI coding, for anything other than hobby work (where you're not paying for the human labor). $200/month is an expensive hobby, but it's negligible as a business expense; SalesForce licenses cost far more.

The key question is how well it a given model does the work, which is a lot harder to measure. But I think token costs are still an order of magnitude below the point where a US-based developer using AI for coding should be asking questions about price; at current price points, the cost/benefit question is dominated by what makes the best use of your limited time as an engineer.

aenis1mo ago

That.

We already shipped 3 things this year built using Claude. The biggest one was porting two native apps into one react native app - which was originally estimated to be a 6-7 month project for a 9 FTE team, and ended up being a 2 months project with 2 people. To me, the economic value of a claude subscription used right is in the range of 10-40k eur, depending on the type of work and the developer driving it. If Anthropic jacked the prices 100x today, I'd still buy the licenses for my guys.

Edit: ok, if they charged 20k per month per seat I'd also start benchmarking the alternatives and local models, but for my business case, running a 700M budget, Claude brings disproportionate benefis, not just in time saved in developer costs, but also faster shipping times, reduced friction between various product and business teams, and so on. For the first time we generally say 'yes' to whichever frivolities our product teams come up with, and thats a nice feeling.

3 more replies

Ifkaluva1mo ago

$200 a month is not what the BigTechs are talking about.

They are talking about every IC becomes an EM, managing teams of agents.

Did you see the leak of Meta’s token consumption? That’s waaay more than you can get for a small $200 a month plan.

1 more reply

guelo1mo ago

Since Anthropic has capacity problems I'm pretty sure they're limiting the $20/month guys to serve the $200/month business plans. I'm afraid coding will increasingly become pay-to-play. Luckily there is good competition.

hyraki1mo ago

Yes 200 as a business expense is really not that bad. But a hobby is hard to justify.

1 more reply

lnrd1mo ago

Only small businesses and startups pay $200/month, most medium+ sized companies will have an enterprise plan and pay by token usage to access the security, privacy, and compliance guarantees that their legal and security teams require.

Also, I think the $200/mo plan is subsidized by VC money and is likely hemorrhaging money for Anthropic, so it's not really meaningful to reason around that.

HarHarVeryFunny1mo ago

It seems far from clear at this point what the dollar value of agentic coding tools is if measured objectively in terms of value delivered.

IF they can be shown to be multiplying developer productivity (completing more projects on time, without reduction in quality and associated costs) by some significant amount then they are providing value at current cost, but it's not at all clear whether that is in fact the case, especially since most of the claims of productivity are anecdotal and/or based on things like LOC generated rather than delivered functionality.

Meta's "token usage leaderboard" shows how far some companies are from measuring anything meaningful! It'd be exactly like some company in the .com era measuring employee's "productivity" by how many bytes they'd downloaded from the internet each day (even if that was just a cat video). "Woo hoo, we're out-internetting you! Our internet bill is enormous!" (then proceeds to fire the guy coding, and gives a bonus to the one downloading cat videos).

There have been some studies/polls done indicating that some very high percentage (90%?) of corporate AI projects are failing. Why is this? Are they ill-conceived, and or ill-executed? Is it the quality of what's being produced that is causing these projects to be abandoned and/or considered as a failure?

There have also been some separate studies indicating programmer productivity to be reduced, not increased, by use of AI coding tools, which is easy to understand. The developer struggles with the tool and it's fallibilities, eventually gets it to generate something that works, then closes his JIRA story with an "AI coded" tag (which shows up on the boss's dashboard, and is all that he sees). Was this an AI productivity success story? To the boss perhaps, but not if the developer admits that it would have just been faster to do it the old way by hand or cut-n-paste from stack overflow.

chis1mo ago

Yeah completely agree. Even out of my own pocket I'd be willing to spend ~1k a month for the current AI, as compared to not having any AI at all. And I bet I could convince an employer to drop 5k a month on it for me. The consumer surplus atm is insane.

paulddraper1mo ago

Claude is far more than $200/month if you use their Enteprise plan.

The $200/month is an individual subscription.

vessenes1mo ago

I mean, my openclaw instance was billing $200 a day for Opus after they banned using the max subscription. I think a fair amount of that was not useful use of Opus; so routing is the bigger problem. but, that sort of adds up, you know! At $1/hr, I loved Openclaw. At $15/hour, it's less competitive.

_pdp_1mo ago· 17 in thread

IMHO there is a point where incremental model quality will hit diminishing returns.

It is like comparing an 8K display to a 16K display because at normal viewing distance, the difference is imperceptible, but 16K comes at significant premium.

The same applies to intelligence. Sure, some users might register a meaningful bump, but if 99% can't tell the difference in their day-to-day work, does it matter?

A 20-30% cost increase needs to deliver a proportional leap in perceivable value.

highfrequency1mo ago

I believe that's why 90% of the focus in these firms is on coding. There is a natural difficulty ramp-up that doesn't end anytime soon: you could imagine LLMs creating a line of code, a function, a file, a library, a codebase. The problem gets harder and harder and is still economically relevant very high into the difficulty ladder. Unlike basic natural language queries which saturate difficulty early.

This is also why I don't see the models getting commoditized anytime soon - the dimensionality of LLM output that is economically relevant keeps growing linearly for coding (therefore the possibility space of LLM outputs grows exponentially) which keeps the frontier nontrivial and thus not commoditized.

In contrast, there is not much demand for 100 page articles written by LLMs in response to basic conversational questions, therefore the models are basically commoditized at answering conversational questions because they have already saturated the difficulty/usefulness curve.

1 more reply

ZeroCool2u1mo ago

Whenever we get the locally runnable 4k models things are going to get really awkward for the big 3 labs. Well at least Google will still have their ad revenue I guess.

3 more replies

levocardia1mo ago

Depends a lot on the task demands. "Got 95% of the way to designing a successful drug" and "Got 100% of the way" is a huge difference in terms of value, and that small bump in intelligence would justify a few orders of magnitude more in cost.

1 more reply

snek_case1mo ago

It probably depends what you're using the models for. If you use them for web search, summarizing web pages, I can imagine there's a plateau and we're probably already hitting it.

For coding though, there is kind of no limit to the complexity of software. The more invariants and potential interactions the model can be aware of, the better presumably. It can handle larger codebases. Probably past the point where humans could work on said codebases unassisted (which brings other potential problems).

1 more reply

simplyluke1mo ago

I'm seeing a lot of sentiment, and agree with a lot of it, that opus 4.6 un-nerfed is there already and for many if not most software use cases there's more value to be had in tooling, speed, and cost than raw model intelligence.

nisegami1mo ago

>IMHO there is a point where incremental model quality will hit diminishing returns.

It's not necessary a single discrete point I think. In my experience, it's tied to the quality/power of your harness and tooling. More powerful tooling has made revealed differences between models that were previously not easy to notice. This matches your display analogy, because I'm essentially saying that the point at which display resolution improvements are imperceptible matters on how far you sit.

aray07OP1mo ago

yeah thats is my biggest issue - im okay with paying 20-30% more but what is the ROI? i dont see an equivalent improvement in performance. Anthropic hasnt published any data around what these improvements are - just some vague “better instruction following"

2 more replies

Rapzid1mo ago

At normal viewing distance(let's say cinema FOV) most people won't see a difference between 4k and 8k never mind 16k.

And it's not that they "don't notice" it's that they physically can't distinguish finer angular separation.

mlinsey1mo ago

I agree, but also the model intelligence is quite spikey. There are areas of intelligence that I don't care at all about, except as proxies for general improvement (this includes knowledge based benchmarks like Humanity's Last Exam, as well as proving math theorems etc). There are other areas of intelligence where I would gladly pay more, even 10X more, if it meant meaningful improvements: tool use, instruction following, judgement/"common sense", learning from experience, taste, etc. Some of these are seeing some progress, others seem inherent to the current LLM+chain of thought reasoning paradigm.

1 more reply

jasonjmcghee1mo ago

It's more like, if it gets it right 99% of the time, that sounds incredible.

Until it's making 100k decisions a day and many are dependent on previous results.

wellthisisgreat1mo ago

Does anyone here use 8k display for work? Does it make sense over 4k?

I was always wondering where that breaking point for cost/peformance is for displays. I use 4K 27” and it’s noticeably much better for text than 1440p@27 but no idea if the next/ and final stop is 6k or 8k?

1 more reply

_pdp_1mo ago

Longer version of the comment https://www.linkedin.com/pulse/imperceptible-upgrade-petko-d...

mgraczyk1mo ago

This will probably happen but I wouldn't plan on it happening soon

AlfeG1mo ago

At this point, I still don't see a reason to use Opus. I'm happy with Sonnet's performance for a third of the price. Tried several times with not a big gain.

naasking1mo ago

Diminishing returns are inevitable, agreed, but it's not clear we're near that point yet.

zadkey1mo ago

yeah there needs to be a corresponding increment improvement in model archetecture.

iLoveOncall1mo ago

> IMHO there is a point where incremental model quality will hit diminishing returns.

You mean a couple of years ago?

speedgoose1mo ago· 5 in thread

The "multiplier" on Github Copilot went from 3 to 7.5. Nice to see that it is actually only 20-30% and Microsoft wanting to lose money slightly slower.

https://docs.github.com/fr/copilot/reference/ai-models/suppo...

Someone12341mo ago

Yep, and I just made a recommendation that was essentially "never enable Opus 4.7" to my org as a direct result. We have Opus 4.6 (3x) and Opus 4.5 (3x) enabled currently. They are worth it for planning.

At 7.5x for 4.7, heck no. It isn't even clear it is an upgrade over Opus 4.6.

5 more replies

Aurornis1mo ago

This article is only about the tokenizer. It doesn't measure the number of tokens needed for each request, which could be higher or lower overall.

intuxikated1mo ago

And that is temporary pricing. Looking at 4.6 fast, I'm assuming this price will go up to 15 once the promo ends

anentropic1mo ago

oh wow, that is very telling!

aulin1mo ago

Opus 4.6 also just got dumber. It's dismissive, hand-wavy, jumps to conclusions way too quickly, skips reasoning... Bubble is going to burst, either some big breakthrough comes up or we are going to see a very fast enshittificafion.

namnnumbr1mo ago· 6 in thread

The title is a misdirection. The token counts may be higher, but the cost-per-task may not be for a given intelligence level. Need to wait to see Artificial Analysis' Intelligence Index run for this, or some other independent per-task cost analysis.

The final calculation assumes that Opus 4.7 uses the exact same trajectory + reasoning output as Opus 4.6. I have not verified, but I assume it not to be the case, given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.

alach111mo ago

I ran an internal (oil and gas focused) benchmark yesterday and found Opus 4.7 was 50% cheaper than Opus 4.6, driven by significantly fewer output tokens for reasoning. It also scored 80% (vs. 60%).

1 more reply

bisonbear1mo ago

yep, ran a controlled experiment on 28 tasks comparing old opus 4.6 vs new opus 4.6 vs 4.7, and found that 4.7 is comparable in cost to old 4.6, and ~20% more expensive then new 4.6 (because new 4.6 is thinking less)

https://www.stet.sh/blog/opus-4-7-zod

1 more reply

dang1mo ago

(Submitted title was "Claude Opus 4.7 costs 20–30% more per session". We've since changed it to a (more neutral) version of what the article's title says.)

1 more reply

aray07OP1mo ago

im running some experiments on this but based on what i have seen on my own personal data - I dont think this is true

"given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.”

Opus 4.7 in general is more expensive for similar usage. Now we can argue that is provides better performance all else being equal but I haven’t been able to see that

namnnumbr1mo ago

Following up on "strictly better" via plot in release announcement:

https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

unpwn1mo ago

Very unlikely that the article is wrong. the 4.7 intelligence bump is not that big, plus most of the token spend is in inputs/tool calls etc, much of which won't change even with this bump.

1 more reply

_fat_santa1mo ago· 7 in thread

A question I've been asking alot lately (really since the release of GPT-5.3) is "do I really need the more powerful model"?

I think a big issue with the industry right now is it's constantly chasing higher performing models and that comes at the cost of everything else. What I would love to see in the next few years is all these frontier AI labs go from just trying to create the most powerful model at any cost to actually making the whole thing sustainable and focusing on efficiency.

The GPT-3 era was a taste of what the future could hold but those models were toys compare to what we have today. We saw real gains during the GPT-4 / Claude 3 era where they could start being used as tools but required quite a bit of oversight. Now in the GPT-5 / Claude 4 era I don't really think we need to go much further and start focusing on efficiency and sustainability.

What I would love the industry to start focusing on in the next few years is not on the high end but the low end. Focus on making the 0.5B - 1B parameter models better for specific tasks. I'm currently experimenting with fine-tuning 0.5B models for very specific tasks and long term I think that's the future of AI.

namnnumbr1mo ago

Yes! I'd be totally happy with today's sonnet 4.6 if I could run it locally.

If you can forgive the obviously-AI-generated writing, [CPUs Aren't Dead](https://seqpu.com/CPUsArentDead) makes an interesting point on AI progress: Google's latest, smallest Gemma model (Gemma 4 E2B), which can run on a cell phone, outperforms GPT-3.5-turbo. Granted, this factoid is based on `MT-Bench` performance, a benchmark from 2023 which I assume to be both fully saturated and leaked into the training data for modern LLMs. However, cross-referencing [Artificial Analysis' Intelligence Index](https://artificialanalysis.ai/models?models=gemma-4-e2b-non-...) suggests that indeed the latest 2B open-weights models are capable of matching or beating 175B models from 3-4 years ago. Perhaps more impressive, [Gemma 4 E4B matches or beats GPT-4o](https://artificialanalysis.ai/models?models=gemma-4-e4b%2Cge...) on many benchmarks.

If this trend continues, perhaps we'll have the capabilities of today's best models available to reasonably run on our laptops!

renticulous1mo ago

Does everyone need a graphing calculator? Does everyone need a scientific calculator? Does everyone need a normal calculator? Does everyone need GeoGebra or Desmos ?

minimaxir1mo ago

Many people were hoping that Sonnet 4.6 was "Opus 4.5 quality but with Sonnet speed/cost" but unfortunately that didn't pan out.

1 more reply

samuelknight1mo ago

The cost of intelligence is non-linear, with slightly dumber models costing much less. For a growing surface of problems you do not need frontier intelligence. You should use frontier intelligence for situations where you would otherwise require human intervention throughout the workflow, which is much more expensive than any model.

Bridged77561mo ago

Efficiency doesn't make as much money. It's in big LLM's best interest to keep inference computationally expensive.

I personally think the whole "the newest model is crazy! You've gotta use X (insert most expensive model)" Is just FOMO and marketing-prone people just parroting whatever they've seen in the news or online.

nprateem1mo ago

So you're happy with an untrustworthy lazy moron prone to stupid mistakes and guesswork?

Surely you can see the first lab that solves this gains a massive advantage?

fkealy1mo ago

I agree, and yet here i am using it... However, I think the industry IS going multiple directions all at once with smaller models, bigger models etc. I need to try out Google's latest models but alas what can one person do in the face of so many new models...

uberman1mo ago· 4 in thread

On actual code, I see what you see a 30% increase in tokens which is in-line with what they claim as well. I personally don't tend to feed technical documentation or random pros into llms.

Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?"

Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified.

tetha1mo ago

Yeah that was an interesting discovery in a development meeting. Many people were chasing after the next best model and everything, though for me, Sonnet 4.6 solves many topics in 1-2 rounds. I mainly need some focus on context, instructions and keeping tasks well-bounded. Keeping the task narrow also simplifies review and staying in control, since I usually get smaller diffs back I can understand quickly and manage or modify later.

I'll look at the new models, but increasing the token consumptions by a factor of 7 on copilot, and then running into all of these budget management topics people talk about? That seems to introduce even more flow-breakers into my workflow, and I don't think it'll be 7 times better. Maybe in some planning and architectural topics where I used Opus 4.6 before.

1 more reply

jstummbillig1mo ago

I don't understand how people measure how much more or less work they need to do. It's not that gpt-4o was incapable of exuding enormous amounts of code quickly, it's that the tokens were relativ garbage.

How do you have an opinion on 4.6/4.7 here? It's less clear but I could totally see that 4.7 or beyond leads to project completion 20% faster, by removing dead ends, foot guns, less backtracking, etc.

How to tell / measure effectively? No clue.

1 more reply

pier251mo ago

haven't people been complaining lately about 4.6 getting worse?

2 more replies

grim_io1mo ago

How long will they host 4.6? Maybe longer for enterprise, but if you have a consumer subscription, you won't have a choice for long, if at all anymore.

3 more replies

admiralrohan1mo ago· 2 in thread

In Kolkata, sweet sellers was struggling with cost management after covid due to increased prices of raw materials. But they couldn't increase the price any further without losing customers. So they reduced the size of sweets instead, and market slowly reduced expectations. And this is the new normal now.

Human psychology is surprisingly similar, and same pattern comes across domains.

hirako20001mo ago

It's not just in Kolkata, worldwide packs of biscuits etc remained the same size but less inside.

I didn't buy Springles chips in years, even the box now is nothing like it was. Thinner. Shorter. I imagine how far from the top the slices stack up.

steelbrain1mo ago

See also: Shrinkflation (https://en.wikipedia.org/wiki/Shrinkflation)

1 more reply

montjoy1mo ago· 1 in thread

It appears that they are testing using Max. For 4.7 Anthropic recognizes the high token usage of max and recommends the new xhigh mode for most cases. So I think the real question is whether 4.7 xhigh is “better” than 4.6 max.

> max: Max effort can deliver performance gains in some use cases, but may show diminishing returns from increased token usage. This setting can also sometimes be prone to overthinking. We recommend testing max effort for intelligence-demanding tasks.

> xhigh (new): Extra high effort is the best setting for most coding and agentic use cases

Ref: https://platform.claude.com/docs/en/build-with-claude/prompt...

dcrazy1mo ago

Inserting an xhigh tier and pushing max way out has very “these go to 11” vibes.

atonse1mo ago· 7 in thread

Just yesterday I was happy to have gotten my weekly limit reset [1]. And although I've been doing a lot of mockup work (so a lot of HTML getting written), I think the 1M token stuff is absolutely eating up tokens like CRAZY.

I'm already at 27% of my weekly limit in ONE DAY.

https://news.ycombinator.com/item?id=47799256

jabart1mo ago

I'm seeing the opposite. With Opus 4.7 and xhigh, I'm seeing less session usage , it's moving faster, and my weekly usage is not moving that much on a Team Pro account.

cbm-vic-201mo ago

Four day workweek!

richstokes1mo ago

My personal Claude sub (Pro), I can burn through my limit in a couple of hours when using Opus. It's borderline unusable unless you're willing to pay for extended usage or artificially slow yourself down.

1 more reply

aray07OP1mo ago

yeah similar for me - it uses a bunch more tokens and I haven’t been able to tell the ROI in terms of better instruction following

it seems to hallucinate a bit more (anecdotal)

1 more reply

CharlesW1mo ago

> I'm already at 27% of my weekly limit in ONE DAY.

Ouch, that's very different than experience. What effort level? Are you careful to avoid pushing session context use beyond 350k or so (assuming 1m context)?

2 more replies

sreekanth8501mo ago

Iam at 22%, just two task. A bug fixing and a Scalar integration.

AndyNemmity1mo ago

I'm at 35% :(

sipsi1mo ago

I tried to do my usual test (similar to pelican but a bit more complex) but it ran out of 5 hour limit in 5 minutes. Then after 5 hours I said "go on" and the results were the worst I've ever seen.

ericol1mo ago· 5 in thread

I did some work yesterday with Opus and found it amazing.

Today we are almost on non-speaking terms. I'm asking it to do some simple stuff and he's making incredible stupid mistakes:

    This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?

and at the same time the compacting is firing like crazy. (What adds ~4 minute delays every 1 - 15 minutes)

  | # | Time     | Gap before | Session span | API calls |
  |---|----------|-----------|--------------|-----------|
  | 1 | 15:51:13 | 8s        | <1m          | 1         |
  | 2 | 15:54:35 | 48s       | 37m          | 51        |
  | 3 | 16:33:33 | 2s        | 19m          | 42        |
  | 4 | 16:53:44 | 1s        | 9m           | 30        |
  | 5 | 17:04:37 | 1s        | 17m          | 30        |
  # — sequential compaction event number, ordered by time.

  Time — timestamp of the first API call in the resumed session, i.e. when the new context (carrying the compaction summary) was first sent to the
  model.

  Gap before — time between the last API call of the prior session and the first call of this one. Includes any compaction processing time plus user
   think time between the two sessions.

  Session span — how long this compaction-resumed session ran, from its first API call to its last before the next compaction (or end of session).

  API calls — total number of API requests made during this resumed session. Each tool use, each reply, each intermediate step = one request.

Bottomline, I will probably stay on Sonnet until they fix all these issues.

losvedir1mo ago

> This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?

I don't know if you're giving this as something you've actually given Claude, but I don't think it's a good way of using Claude.

It's not a collaborator who's having a bad day where a little empathy might make him feel better and realize his error. It's a token generator based on a prompt which includes all chat history. If you have three examples of the bad approach in the history, in a format that looks like Claude doing work, it will totally pollute it! And even worse with auto-compaction where you don't know exactly what of those false starts is getting summarized into its context.

You have to treat this like a tool and understand how it works.

If Claude is going down a wrong path it's better to cancel and rewind and improve the previous addition to the prompt. You don't want it to generate a bunch of misleading tokens for itself and leave it in the context window indefinitely!

2 more replies

aulin1mo ago

They won't. These are not "issues", it's them trying to push the models to burn less compute. It will only get worse.

2 more replies

cadamsdotcom1mo ago

> he’s making .. mistakes

Claude and other LLMs do not have a gender; they are not a “he”. Your LLM is a pile of weights, prompts, and a harness; anthropomorphising like this is getting in the way.

You’re experiencing what happens when you sample repeatedly from a distribution. Given enough samples the probability of an eventual bad session is 100%.

Just clear the context, roll back, and go again. This is part of the job.

3 more replies

solenoid09371mo ago

This is not how AI works man. Speaking condescendingly or sternly to it WILL result in worse output. Imagine if you spoke to an intern like that, would they make more or less mistakes after?

You should just revert the context and provide more detail and rationale in the message.

whalesalad1mo ago

I am having a shit experience lately. Opus 4.7, max effort.

> You're right, that was a shit explanation. Let me go look at what V1 MTBL actually is before I try again.

> Got it — I read the V1 code this time instead of guessing. Turns out my first take was wrong in an important way. Let me redo this in English.

:facepalm:

5 more replies

yuanzhi12031mo ago· 1 in thread

We noticed this two weeks ago where we found some of our requests are unexpected took more tokens than measured by count_tokens call. At the end they were Anthropic's A/B testing routing some Opus 4.6 calls to Opus 4.7.

https://matrix.dev/blog-2026-04-16.html (We were talking to Opus 4.7 twelve days ago)

ec1096851mo ago

Wonder what they do for their token cache if they swap mid-session like that.

1 more reply

jmward011mo ago

Claude code seems to be getting worse on several fronts and better on others. I suspect product is shifting from 'make it great' to 'make it make as much money for us as possible and that includes gathering data'.

Recently it started promoting me for feedback even though I am on API access and have disabled this. When I did a deep dive of their feedback mechanism in the past (months ago so probably changed a lot since then) the feedback prompt was pushing message ids even if you didn't respond. If you are on API usage and have told them no to training on your data then anything pushing a message id implies that it is leaking information about your session. It is hard to keep auditing them when they push so many changes so I am now 'default they are stealing my info' instead of believing their privacy/data use policy claims. Basically, my level of trust is eroding fast in their commitment to not training on me and I am paying a premium to not have that happen.

taosx1mo ago· 2 in thread

Claude seems so frustrating lately to the point where I avoid and completely ignore it. I can't identify a single cause but I believe it's mostly the self-righteousness and leadership that drive all the decisions that make me distrust and disengage with it.

QuercusMax1mo ago

What do you mean by this? What are you frustrated by?

You're offended by their political beliefs, so you don't like the way the model works?

estearum1mo ago

using dumber models to own the libs

1 more reply

margorczynski1mo ago· 3 in thread

It doesn't look good for Anthropic, especially considering they are burning billions in investor money.

Looks like they lost the mandate of heaven, if Open AI plays it right it might be their end. Add to that the open source models from China.

throwaway0412071mo ago

I work at a company that has gone all in on Anthropic, and we're just shoveling money at them. I suspect there are a more enterprises than we realize that are doing this.

When I read these comments on Hacker News, I see a lot of people miffed about their personal subscription limits. I think this is a viewpoint that is very consumer focused, and probably within Anthropic they're seeing buckets of money being dumped on them from enterprises. They probably don't really care as much about the individual subscription user, especially power users.

solenoid09371mo ago

1. HN is so unrepresentative of real life. You have people on their $20/$200 subscriptions complaining about usage limits. They are a tiny fraction of Anthropic's revenue. API billing and enterprise is where the money is.

2. Anthropic and OpenAI's financials are totally different. The former has nearly the same RRR and a fraction of the cash burn. There is a reason Anthropic is hot on secondary and OAI isn't

therobots9271mo ago

OpenAI is dealing with exactly the same energetic and financial constraints as Anthropic. That will become apparent soon.

qq661mo ago· 1 in thread

This is the backdoor way of raising prices... just inflate the token pricing. It's like ice cream companies shrinking the box instead of raising the price

Bridged77561mo ago

No, you're forgetting the never ending world shattering models being released every couple of months. Each one with 2X token costs of course, for a vague performance gain and that will deprecate the previous ones.

2 more replies

jmward011mo ago· 2 in thread

Yeah. I just did a day with 4.7 and I won't be going back for a while. It is just too expensive. On top of the tokenization the thinking seems like it is eating a lot more too.

JimmaDaRustla1mo ago

What was your level methodology and results? Can't just post "too expensive" and not explain how you went about it.

aray07OP1mo ago

yeah i am still not clear why there are 5 effort modes now on top of more expensive tokenization

2 more replies

iknowstuff1mo ago· 2 in thread

Interesting because I already felt like current models spit out too much garbage verbose code that a human would write in a far more terse, beautiful and grokable way

QuercusMax1mo ago

I had a case yesterday where Claude wrote me a series of if/elses in python. I asked it if it could use some newer constructs instead, and it told me that I was on a new enough python version that I could use match/case. Great!

And then it proceeded to rewrite the block with a dict lookup plus if-elses, instead of using match/case. I had to nag it to actually rewrite the code the way it said it would!

aray07OP1mo ago

yeah opus 4.7 feels a lot more verbose - i think they changed the system prompt and removed instructions to be terse in its responses

varispeed1mo ago

Don't forget that the model doesn't have an incentive to give right solution the first time. At least with Opus 4.6 after it got nerfed, it would go round in circles until you tell it to stop defrauding you and get to correct solution. That not always worked though. I found starting session again and again until less nerfed model was put on the request. Still all points to artificially make customer pay more.

sho1mo ago· 1 in thread

Taking the article's 5% accuracy improvement at face value: if true, then it's more than worth the token inflation IMO. That's because of tool call chains, where errors compound and accumulate, and small improvements in accuracy get greatly magnified.

Again, the article's numbers are likely a rather crude approximation, but taking 85% accuracy (claude 4.6) vs 90% (4.7) as inputs:

  4.6 1 iteration 85%
  4.7 1 iteration 90%
  4.6 5 iterations 44.37%
  4.7 5 iterations 59.85%
  4.6 10 iterations 19.69%
  4.7 10 iterations 34.87%

Compounded, small improvements really move the needle downstream. 1.4x doesn't seem worth it for 5% better, but 10 calls in, that's more than a 40% improvement.

rohansood151mo ago

You're assuming errors cannot be retried/recovered. They can.

1 more reply

noisy_boy1mo ago

At this point, as an experienced developer, unless they can promise consistent very high quality, which they can't, I would rather lean towards almost as good but faster. At this point, that compromise is Codex.

I would rather steer quickly, get ideas because I'm moving quickly, do course correction quickly - basically I'm not happy blocking my chain of thought/concentration and fall prey to distractions due to Claude's slowness and compaction cycles. Sometimes I don't even notice that Codex has compacted.

For architectural discussions, sure I'll pick Claude. I'm mentally prepared for that. But once we are in the thick of things, speed matters. I would they rather focus on improving Sonnet's speed.

Yukonv1mo ago

Some broad assumptions are being made that plans give you a precise equivalent to API cost. This is not the case with reverse engineering plan usage showing cached input is free [0]. If you re-run the math removing cached input the usage cost is ~5-34% more. Was the token plan budget increase [1] proportional to account for this? Can’t say with certainty. Those paying API costs though the price hike is real.

[0] https://she-llac.com/claude-limits

[1] https://xcancel.com/bcherny/status/2044839936235553167

technotony1mo ago· 1 in thread

Not only that but they seem to have cut my plan ability to use Sonnet too. I have a routine that used to use about 40% of my 5 hour max plan tokens, then since yesterday it gets stopped because it uses the whole 100%. Anyone else experience this?

mfro1mo ago

yeah it seems like sonnet 4.6 burns thru tokens crazy fast. I did one prompt, sonnet misunderstood it as 'generate an image of this' and used all of my free tokens.

therobots9271mo ago

As a regular listener of Ed Zitron this comes as absolutely no surprise. Once you understand the levels of obfuscation available to anthro / OAI you will realize that they have almost certainly hit a model plateau ~1 year ago. All benchmark improvements since have come at a high compute cost. And the model used when evaluating said benchmarks is not the same model you get with your subscription.

This is already becoming apparent as users are seeing quality degrade which implies that anthropic is dropping performance across the board to minimize financial losses.

adaptive_loop1mo ago· 2 in thread

Every time a new model comes out, I'm left guessing what it means for my token budget in order to sustain the quality of output I'm getting. And it varies unpredictably each time. Beyond token efficiency, we need benchmarks to measure model output quality per token consumed for a diverse set of multi-turn conversation scenarios. Measuring single exchanges is not just synthetic, it's unrealistic. Without good cost/quality trade-off measures, every model upgrade feels like a gamble.

therobots9271mo ago

That’s the joy of purchasing an intangible and non-deterministic product. The profit margin is completely within the vendor’s control and quality is hard for users to measure.

bityard1mo ago

The company I work for provides all engineering employees with a Claude subscription. My job isn't writing (much) code, and we have Copilot with MS Office, plus multiple internal AI tools on top of that. So I'm free to do low-stakes experiments on Claude without having to worry about hitting my monthly usage limit.

I am finding that for complex tasks, Claude's quality of output varies _tremendously_ with repeated runs of the same model and prompt. For example, last week I wrote up (with my own brain and keyboard) a somewhat detailed plain english spec of a work-related productivity app that I've always wanted but never had the time to write. It was roughly the length of an average college essay. The first thing I asked Claude to do was not write any code, but come up with a more formal design and implementation plan based on the requirements that I gave. The idea was to then hand _that_ to Claude and say, okay, now build it.

I used Opus 4.6 with High reasoning for all of this and did not change any model settings between runs.

The first run was overall _amazing_. It was detailed, well-written, contained everything that I asked for. The only drawback was that I was ambiguous on a couple of points which meant that the model went off and designed something in a way that I wasn't expecting and didn't intend. So I cleared that up in my prompt, and instead of keeping the context and building on what was already there, I started a new chat and had it start again from scratch.

What it wrote the second time was _far_ less impressive. The writing was terse, there was a lot less detail, the pretty dependency charts and various tables it made the first time were all gone. Lots of stuff was underspecified or outright missing.

New chat, start again. Similar results as the second run, maybe a bit worse. It also started _writing code_ which was something I told it NOT to do. At this point I'm starting to panic a little because I'm sure I didn't add, "oh, and make it crappy" to the prompt and I was a little angry about not saving the first iteration since it was fairly close to what I had wanted anyway.

I decided to try one last time and it finally gave me back something within about 95% of the first run in terms of quality, but with all the problems fixed. So, I was (finally) happy with that, and it used that to generate the application surprisingly well, with only a few issues that should not be too hard to fix after the fact.

So I guess 4th time was a charm, and the fare was about $7 in tokens to get there.

sysmax1mo ago· 2 in thread

Well, LLMs are priced per token, and most of the tokens are just echoing back the old code with minimal changes. So, a lot of the cost is actually paying for the LLM to echo back the same code.

Except, it's not that trivial to solve. I tried experimenting with asking the model to first give a list of symbols it will modify, and then just write the modified symbols. The results were OK, but less refined than when it echoes back the entire file.

The way I see it is that when you echo back the entire file, the process of thinking "should I do an edit here" is distributed over a longer span, so it has more room to make a good decision. Like instead of asking "which 2 of the 10 functions should you change" you're asking it "should you change method1? what about method2? what about method3?", etc., and that puts less pressure on the LLM.

Except, currently we are effectively paying for the LLM to make that decision for *every token*, which is terribly inefficient. So, there has to be some middle ground between expensively echoing back thousands of unchanged tokens and giving an error-ridden high-level summary. We just haven't found that middle ground yet.

mmastrac1mo ago

I think the ideal way for these LLMs to work will be using AST-level changes instead of "let me edit this file".

grit.io was working on this years ago, not sure if they are still alive/around, but I liked their approach (just had a very buggy transformer/language).

gruez1mo ago

>and most of the tokens are just echoing back the old code with minimal changes

I thought coding harnesses provided tools to apply diffs so the LLM didn't have to echo back the entire file?

1 more reply

khalic1mo ago

Just hit my quota with 20x for the first time today…

beej711mo ago· 8 in thread

News like this always makes me wonder about running my own model, something I've never done. A couple thousand bucks can get you some decent hardware, it looks like, but is it good for coding? What is your all's experience?

And if it's not good enough for coding, what kind of money, if any, would make it good enough?

arcanemachiner1mo ago

I want to give give you realistic expectations: Unless you spend well over $10K on hardware, you will be disappointed, and will spend a lot of time getting there. For sophisticated coding tasks, at least. (For simple agentic work, you can get workable results with a 3090 or two, or even a couple 3060 12GBs for half the price. But they're pretty dumb, and it's a tease. Hobby territory, lots of dicking around.)

Do yourself a favor: Set up OpenCode and OpenRouter, and try all the models you want to try there.

Other than the top performers (e.g. GLM 5.1, Kimi K2.5, where required hardware is basically unaffordable for a single person), the open models are more trouble than they're worth IMO, at least for now (in terms of actually Getting Shit Done).

1 more reply

bakugo1mo ago

You should be aware that any model you can run on less than $10k worth of hardware isn't going to be anywhere close to the best cloud models on any remotely complex task.

Many providers out there host open weights models for cheap, try them out and see what you think before actually investing in hardware to run your own.

mfro1mo ago

Not sure why all the other commentors are failing to mention you can spend considerably less money on an apple silicon machine to run decent local models.

Fun fact: AWS offers apple silicon EC2 instances you can spin up to test.

__mharrison__1mo ago

My anecdotal experience with a recent project (Python library implemented and released to pypi).

I took the plan that I used from Codex and handed it to opencode with Qwen 3.5 running locally.

It created a library very similar to Codex but took 2x longer.

I haven't tried Qwen 3.6 but I hear it's another improvement. I'm confident with my AI skills that if/when cheap/subsidized models go away, I'll be fine running locally.

efficax1mo ago

gemma4 and qwen3.6 are pretty capable but will be slower and wrong more often than the larger models. But you can connect gemma4 to opencode via ollama and it.. works! it really can write and analyze code. It's just slow. You need serious hardware to run these fast, and even then, they're too small to beat the "frontier" models right now. But it's early days

DeathArrow1mo ago

Unless you use H100 or 4x 5090 you won't get a decent output.

The best bang for the buck now is subcribing to token plans from Z.ai (GLM 5.1), MiniMax (MiniMax M2.7) or ALibaba Cloud (Qwen 3.6 Plus)

Running quantized models won't give you results comparable to Opus or GPT.

hleszek1mo ago

The latest Qwen3.6 model is very impressive for its size. Get an RTX 3090 and go to https://www.reddit.com/r/LocalLLaMA/ to see the latest news on how to run models locally. Totally fine for coding.

aray07OP1mo ago

i think the new qwen models are supposed to be good based on some the articles that i read

TomGarden1mo ago

Asked Opus 4.7 to extend an existing system today. After thorough exploration and a long back and forth on details it came up with a plan. Then proceeded to build a fully parallel, incompatible system from scratch with the changes I wanted but everything else incompatible and full of placeholders

encoderer1mo ago

In my “repo os” we have an adversarial agent harness running gpt5.4 for plan and implementation and opus4.6 for review. This was the clear winner in the bake-off when 5.4 came out a couple months ago.

Re-ran the bake-off with 4.7 authoring and… gpt5.4 still clearly winning. Same skills, same prompts, same agents.md.

jstummbillig1mo ago

"One session" is not a very interesting unit of work. What I am interested in is how much less work I am required to do, to get the results I want.

This is not so much about my instructions being followed more closely. It's the LLM being smarter about what's going on and for example saving me time on unnecessary expeditions. This is where models have been most notably been getting better to my experience. Understanding the bigger picture. Applying taste.

It's harder to measure, of course, but, at least for my coding needs, there is still a lot of room here.

If one session costs an additional 20% that's completely fine, if that session gets me 20% closer to a finished product (or: not 20% further away). Even 10% closer would probably still be entirely fine, given how cheap it is.

2001zhaozhao1mo ago

FYI: Anthropic increased people's subscription quotas to counteract the token cost change. In classic Anthropic fashion this is only announced via X post and not any official announcement.

However, if you are using API costs then I guess you're left holding the bag.

ndom911mo ago

`/model claude-opus-4-6`

curioussquirrel1mo ago· 1 in thread

Claude's tokenizers have actually been getting less efficient over the years (I think we're at the third iteration at the least since Sonnet 3.5). And if you prompt the LLM in a language other than English, or if your users prompt it or generate content in other languages, the costs go higher even more. And I mean hundreds of percent more for languages with complex scripts like Tamil or Japanese. If you're interested in the research we did comparing tokenizers of several SOTA models in multiple languages, just hit me up.

arcanemachiner1mo ago

I would encourage you to post a link here, and also to submit to HN if you haven't already. :)

2 more replies

rafram1mo ago

Pretty funny that this article was clearly written by Claude.

avereveard1mo ago

Well yeah it was disclosed here https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-... high is the new xhigh

saltyoldman1mo ago

I was sort of hoping that the peak is something like $15 per hour of vibe help (yes I know some of you burn $15 in 12milliseconds), and that you can have last year's best or the current "nano/small" model at $1 per hour.

But it looks like it's just creeping up. Probably because we're paying for construction, not just inference right now.

redml1mo ago

It does cost more but I found the quality of output much higher. I prefer it over the dumbing of effort/models they were doing for the last two months. They have to get users used to picking the appropriate model for their task (or have an automatic mode - but still let me force it to a model).

bilekas1mo ago

> The model will not silently generalize an instruction from one item to another.

I am clearly missing something but wouldn't this be an ideal thing to do? Surely if it was optimised it would use less tokens while not losing anything from the instructions ?

JimmaDaRustla1mo ago

Am I dumb, or are they not explaining what level thinking they're using? We all read the Anthropic blog post yesterday - 4.7 max consumes/produces an incredible number of tokens and it's not equivalent to 4.6 max; xhigh is the new "max".

motbus31mo ago· 1 in thread

I've been using 4.6 models since each of them launched. Same for 4.5.

4.6 performers worse or the same in most of the tasks I have. If there is a parameter that made me use 4.6 more frequently is because 4.5 get dumber and not because 4.6 seemed smarter.

nwienert1mo ago

Agree on all counts, 4.5 was a monster, 4.6 a clear regression, and then 4.5 was dumbed down so I moved on.

epistasis1mo ago· 3 in thread

Anybody else having problem getting Opus 4.7 to write code? I had it pick up a month-old project, some small one off scripts that I want to modify, and it refused to even touch the code.

So far it costs a lot less, because I'm not going to be using it.

apelapan1mo ago

On the contrary, I threw a multi-threading optimization task on it, that 4.5 and 4.6 have been pretty useless at handling. 4.7 bested my hand-tuned solution by almost 2x on first attempt.

This was what I thought was my best moat as a senior dev. No other model has been able to come close to the throughput I could achieve on my own before. Might be a fluke of course, and they've picked up a few patterns in training that applies to this particular problem and doesn't generalize. We'll see.

1 more reply

GeoAtreides1mo ago

> it refused to even touch the code.

???

please i beg post the prompt and the refusal

I literally can not imagine a model refusing to do something

1 more reply

mrtesthah1mo ago

No, see, we have to leave writing code to fully identity-verified individuals working on behalf of only the largest institutions now because what if they decided to write malware?

2001zhaozhao1mo ago

To me, all of this seems to be pointing to the future solution being some sort of diffusion-based LLM that can process multiple tokens per pass, while keeping the benefits of more "verbose" token encoding.

DiscourseFan1mo ago

Yeah I noticed today, I had it work up a spreadsheet for me and I only got 3 or 4 turns in the conversation before it used up all my (pro) credits. It wasn't even super-complicated or anything, only moderately so.

kinnth1mo ago

It feels like a dedicated orchestration/planning agent needs to be much clearer on costs now as part of the tast plan. Multiple models used at different stages depending on the task.

blurbleblurble1mo ago

4.7 has been incredibly frustrating vs 4.6. Not sure what's going on but it keeps dropping stuff and getting stuck in weird side quests. Hope it gets fixed cause 4.6 was awesome.

zeronone1mo ago

> Only one instruction type moved materially: change_case:english_capital (0/1 → 1/1). Everything else tied.

So the new tokenizer costs for English/code is to support SHOUTING in English?

lacoolj1mo ago· 1 in thread

This is probably an adjacent result of this (from anthropic launch post):

> In Claude Code, we’ve raised the default effort level to xhigh for all plans.

Try changing your effort level and see what results you get

aray07OP1mo ago

effort level is separate from tokenization. Tokenization impacts you the same regardless.

I find 5 thinking levels to be super confusing - I dont really get why they went from 3 -> 5

e1ghtSpace1mo ago· 1 in thread

Do they ever make AIs that are super rediculously expensive to run but get really good scores on tests, and aren't for consumers? Like drag racing for AI?

m00x1mo ago

Mythos is basically this

sarpdag1mo ago

Since the Opus 4.7 release. I hit my 5 hour window limit second time on claude code max plan, which never happened before. I am not happy for sure.

dionian1mo ago

I noticed it was compacting more aggressively which i actually like, because i was letting sessions get really long and using them uncached (parallel sessions)

aliljet1mo ago· 2 in thread

This is the reality I'm seeing too. Does this mean that the subscriptions (5x, 10x, 20x) are essentially reduced in token-count by 20-30%?

aray07OP1mo ago

yeah thats the part that is unclear to me as well - if our usage capacity is now going to run out faster.

1 more reply

cbg01mo ago

Boris said on Twitter that they've increased rate limits for everyone.

bugsense1mo ago

I would use a service like Straion.com to avoid the forths and back. It increases token consumption but I can get things right the first time.

memcoder1mo ago

depends if you're running Opus for everything vs tiering. my pipeline: Haiku 4.5 for ~70% of implementation, Sonnet 4 for one review step, Opus 4.5 only for planning and final synthesis

claude code on opus continuously = whole bill. different measurement.

haiku 4.5 is good enough for fanout. opus earns it on synthesis where you need long context + complex problem solving under constraints

kburman1mo ago

Anthropic must be loving it. It's free money.

markrogersjr1mo ago· 1 in thread

4.7 one-shot rate is at least 20-30% higher for me

ChicagoBoy111mo ago

How are you able to track this as you use it? A bit stumped atm

1 more reply

dallen331mo ago· 1 in thread

I'm still using Sonnet 4.6 with no issues.

risyachka1mo ago

How does this solve the issue? 4.6 will be disabled after one or more release like any other legacy model.

1 more reply

SpyCoder771mo ago

This begs the question: should we translate our prompts into CJK and translate the output back into English?

omega31mo ago· 1 in thread

Contrary to people here who feel the price increases, reduction of subscription limits etc are the result of the Anthropic models being more expensive to run than the API & subscription revenue they generate I have a theory that Anthropic has been in the enshittification & rent seeking phase for a while in which they will attempt to extract as much money out of existing users as possible.

Commercial inference providers serve Chinese models of comparable quality at 0.1x-0.25x. I think Anthropic realised that the game is up and they will not be able to hold the lead in quality forever so it's best to switch to value extraction whilst that lead is still somewhat there.

CharlesW1mo ago

> Commercial inference providers serve Chinese models of comparable quality…

"Comparable" is doing some heavy lifting there. Comparable to Anthropic models in 1H'25, maybe.

1 more reply

ardline1mo ago

This is the kind of thing that looks simple until you're three layers deep in edge cases.

outlore1mo ago

I can manage session cost effectively myself if forking and rewinds were first class features

stefan_1mo ago· 1 in thread

I don't know anything about tokens. Anthropic says Pro has "more usage*", Max has 5x or 20x "more usage*" than Pro. The link to "usage limits" says "determines how many messages you can send". Clearly no one is getting billed for tokens.

aray07OP1mo ago

anthropic’s pricing is all based on token usage

https://platform.claude.com/docs/en/about-claude/pricing

So if you are generating more tokens, you are eating up your usage faster

JohnMakin1mo ago

30% more token use, but even by their benchmarks, don't appear to have any real big successes there, and some regressions. What's the point? It doesn't do any better on the suite of obedience/compliance tests I've written for 4.6, and in some tests, got worse, despite their claim there it is better. Anecdotally, it was gobbling so many tokens on even the simplest queries I immediately shut it off and went back to 4.5.

Why release this?

ricardobeat1mo ago

I can’t stand reading this. One article. Many words. Not written by a human.

Feels like LLMs are devolving into having a single, instantly recognizable and predictable writing style.

clbrmbr1mo ago

How can they change the tokenizer without a wholesale pre-train?

Bingolotto1mo ago

Talked to Claude earlier today and Opus 4.7 cost up to 35% more.

CodingJeebus1mo ago· 3 in thread

The fundamental problem with these frontier model companies is that they're incentivized to create models that burn through more tokens, full stop. It's a tale as old as capitalism: you wake up every day and choose to deliver more value to your customers or your shareholders, you cannot do both simultaneously forever.

People love to throw around "this is the dumbest AI will ever be", but the corollary to that is "this is the most aligned the incentives between model providers and customers will ever be" because we're all just burning VC money for now.

NickC251mo ago

> but the corollary to that is "this is the most aligned the incentives between model providers and customers will ever be" because we're all just burning VC money for now.

Please say this louder for everyone to hear. We are still at the stage where it is best for Anthropic's product to be as consumer aligned (and cost-friendly) as possible. Anthropic is loosing a lot of money. Both of those things will not be true in the near future.

HarHarVeryFunny1mo ago

> The fundamental problem with these frontier model companies is that they're incentivized to create models that burn through more tokens

That's one market segment - the high priced one, but not necessarily the most profitable one. Ferrari's 2025 income was $2B while Toyota's was $30B.

Maybe a more apt comparison is Sun Microsystems vs the PC Clone market. Sun could get away with high prices until the PC Clones became so fast (coupled with the rise of Linux) that they ate Sun's market and Sun went out of business.

There may be a market for niche expensive LLMs specialized for certain markets, but I'll be amazed if the mass coding market doesn't become a commodity one with the winners being the low cost providers, either in terms of API/subscriptions costs, or licensing models for companies to run on their own (on-prem or cloud) servers.

BosunoB1mo ago

Their bigger incentive is to deliver the best product in the cheapest way, because there is tight competition with at least 2 other companies. I know we all love to hate on capitalism but it's actually functioning fine in this situation, and the token inflation is their attempt to provide a better product, not a worse one.

thibran1mo ago

For me there is no point in using Claude Opus 4.7, it's too expensive since it does not do 100% of the job. Since AI can anyway only do 90% of most tasks, I can use another model and do the remaining 15-30% myself.

rambojohnson1mo ago· 2 in thread

So intelligence has turned into a utility per Sam Altman et al., and now the same companies get to hike the price of accessing it by 20–30%, right as it’s becoming the backbone of how teams actually ship work. People are pushing out so much, so fast that last week’s output is already a blur. I’ve got colleagues who refuse to go back to writing any of this stuff by hand.

And now maintaining that pace means absorbing arbitrary price increases, shrugged off with “we were operating at a loss anyway.”

It stops being “pay to play” and starts looking more like pay just to stay in the ring, while enterprise players barely feel the hit and everyone else gets squeezed out.

Market maturing my butthole... it’s obviously a dependency being priced in real time. Tech is an utter shit show right now, compounded by the disaster of the unemployment market still reeling from the overhiring of 2020.

save up now and career pivot. pick up gardening.

wslh1mo ago

> So intelligence has turned into a utility.

"Utility" is close, but "energy source" may be closer. When it becomes the thing powering the pace of work itself, raising prices is less about charging for access and more about taxing dependency.

colechristensen1mo ago

Like every startup ever, they were selling it to you at a loss to compete for market share and are slowly increasing pricing. Duh.

1 more reply

chakintosh1mo ago

Yeah one PRD request of a small scope app cost me 70%

rbren1mo ago

Good reminder to choose model-agnostic tooling!

wartywhoa231mo ago

Seeing this big crowd of people trying to persuade themselves or others that the ever growing hole in their pockets is totally justified and beneficial is pretty hilarious!

Frannky1mo ago

Give it a try to opencode + mimo V2 pro...

Abderahmane1mo ago

Bonjour

AIrtemis1mo ago

here comes the rug-pull to justify the enterprise pricing...

olq_plo1mo ago

That blog post is full of AI slop. Repeats the same argument a gazillion times. It's not X, it's Y. Awful to read.

greatgib1mo ago

What annoys me the most with the proprietary side of Gemini and Claude is that you used to have the tokenizer (standard) and open sourced. So you could understand what was going on, how the model would understand/split the tokens. Now it is trade secret only usable through the api!

synergy201mo ago

that's what i feel, going to use codex more

tornikeo1mo ago

Good lord. Reading all these comments makes me feel so much better for dumping anthropic the first time their opus started becoming dumber (circa Month ago). It feels like most people in this thread are somehow bound to Claude, even though it is alread fully enshittfied.

1 more reply

bcjdjsndon1mo ago

Because those braniacs added 20-30% more system prompt

AIrtemis1mo ago

here comes the rug-pull

socratic_weeb1mo ago

This is good news. It means the bubble is popping. Bye bye VC subsidies...

mikert891mo ago

The compute is expensive, what is with this outrage? People just want free tools forever?

3 more replies

xd19361mo ago

And what about with Caveman[1]?

1. https://github.com/juliusbrussee/caveman

3 more replies

j / k navigate · click thread line to collapse

498 comments

204 comments · 85 top-level

louiereederson1mo ago· 13 in thread

I think the tendency for graphs of model assessment to display the log of cost/tokens on the x axis (i.e. Artificial Analysis' site) has obscured this dynamic.

louiereederson1mo ago

I meant reference Toby Ord's work here. I think his framing of the performance/cost frontier hasn't gotten enough attention https://www.tobyord.com/writing/hourly-costs-for-ai-agents

2 more replies

Aurornis1mo ago

9 more replies

snek_case1mo ago

They're also getting closer to IPO and have a growing user base. They can't justify losing a very large number of billions of other people's money in their IPO prospectus.

So there's a push for them to increase revenue per user, which brings us closer to the real cost of running these models.

3 more replies

ethin1mo ago

1 more reply

iainmerrick1mo ago

Is there an equivalent ultra-high-end LLM you can have if you’re willing to pay? Or does it not exist because it would cost too much to train?

2 more replies

svantana1mo ago

conductr1mo ago

amelius1mo ago

Once they implement their models directly in silicon, the cost will come down and the speed will go up. See Taalas.

1 more reply

Lihh271mo ago

heh adaptive thinking is letting the meter run itself. they make more when it runs longer.

atoav1mo ago

nl1mo ago

This is a bad take. It's not really wrong in the sense that yes higher performance does cost more.

But it ignores completely the fact that the same intelligence is dropping by an order of magnitude (at least) every 12 months.

GPT o1 launched at $600/M output tokens and GPT4.5 launched at $150/M.

Opus 4.7 is $25/M for more intelligence

jimiljojo1mo ago

What a well thought and written comment. I totally agree.

1 more reply

paulddraper1mo ago

> The fact that Anthropic is rapidly trying to increase price may betray the fact that their recent lead is at the cost of dramatically higher operating costs.

Or they are just not willing to burn obscene levels of capital like OpenAI.

1 more reply

tabbott1mo ago· 9 in thread

aenis1mo ago

That.

3 more replies

Ifkaluva1mo ago

$200 a month is not what the BigTechs are talking about.

They are talking about every IC becomes an EM, managing teams of agents.

Did you see the leak of Meta’s token consumption? That’s waaay more than you can get for a small $200 a month plan.

1 more reply

guelo1mo ago

hyraki1mo ago

Yes 200 as a business expense is really not that bad. But a hobby is hard to justify.

1 more reply

lnrd1mo ago

Also, I think the $200/mo plan is subsidized by VC money and is likely hemorrhaging money for Anthropic, so it's not really meaningful to reason around that.

HarHarVeryFunny1mo ago

It seems far from clear at this point what the dollar value of agentic coding tools is if measured objectively in terms of value delivered.

chis1mo ago

paulddraper1mo ago

Claude is far more than $200/month if you use their Enteprise plan.

The $200/month is an individual subscription.

vessenes1mo ago

_pdp_1mo ago· 17 in thread

IMHO there is a point where incremental model quality will hit diminishing returns.

It is like comparing an 8K display to a 16K display because at normal viewing distance, the difference is imperceptible, but 16K comes at significant premium.

The same applies to intelligence. Sure, some users might register a meaningful bump, but if 99% can't tell the difference in their day-to-day work, does it matter?

A 20-30% cost increase needs to deliver a proportional leap in perceivable value.

highfrequency1mo ago

1 more reply

ZeroCool2u1mo ago

Whenever we get the locally runnable 4k models things are going to get really awkward for the big 3 labs. Well at least Google will still have their ad revenue I guess.

3 more replies

levocardia1mo ago

1 more reply

snek_case1mo ago

It probably depends what you're using the models for. If you use them for web search, summarizing web pages, I can imagine there's a plateau and we're probably already hitting it.

1 more reply

simplyluke1mo ago

nisegami1mo ago

>IMHO there is a point where incremental model quality will hit diminishing returns.

aray07OP1mo ago

2 more replies

Rapzid1mo ago

At normal viewing distance(let's say cinema FOV) most people won't see a difference between 4k and 8k never mind 16k.

And it's not that they "don't notice" it's that they physically can't distinguish finer angular separation.

mlinsey1mo ago

1 more reply

jasonjmcghee1mo ago

It's more like, if it gets it right 99% of the time, that sounds incredible.

Until it's making 100k decisions a day and many are dependent on previous results.

wellthisisgreat1mo ago

Does anyone here use 8k display for work? Does it make sense over 4k?

1 more reply

_pdp_1mo ago

Longer version of the comment https://www.linkedin.com/pulse/imperceptible-upgrade-petko-d...

mgraczyk1mo ago

This will probably happen but I wouldn't plan on it happening soon

AlfeG1mo ago

At this point, I still don't see a reason to use Opus. I'm happy with Sonnet's performance for a third of the price. Tried several times with not a big gain.

naasking1mo ago

Diminishing returns are inevitable, agreed, but it's not clear we're near that point yet.

zadkey1mo ago

yeah there needs to be a corresponding increment improvement in model archetecture.

iLoveOncall1mo ago

> IMHO there is a point where incremental model quality will hit diminishing returns.

You mean a couple of years ago?

speedgoose1mo ago· 5 in thread

The "multiplier" on Github Copilot went from 3 to 7.5. Nice to see that it is actually only 20-30% and Microsoft wanting to lose money slightly slower.

https://docs.github.com/fr/copilot/reference/ai-models/suppo...

Someone12341mo ago

At 7.5x for 4.7, heck no. It isn't even clear it is an upgrade over Opus 4.6.

5 more replies

Aurornis1mo ago

This article is only about the tokenizer. It doesn't measure the number of tokens needed for each request, which could be higher or lower overall.

intuxikated1mo ago

And that is temporary pricing. Looking at 4.6 fast, I'm assuming this price will go up to 15 once the promo ends

anentropic1mo ago

oh wow, that is very telling!

aulin1mo ago

namnnumbr1mo ago· 6 in thread

alach111mo ago

I ran an internal (oil and gas focused) benchmark yesterday and found Opus 4.7 was 50% cheaper than Opus 4.6, driven by significantly fewer output tokens for reasoning. It also scored 80% (vs. 60%).

1 more reply

bisonbear1mo ago

https://www.stet.sh/blog/opus-4-7-zod

1 more reply

dang1mo ago

(Submitted title was "Claude Opus 4.7 costs 20–30% more per session". We've since changed it to a (more neutral) version of what the article's title says.)

1 more reply

aray07OP1mo ago

im running some experiments on this but based on what i have seen on my own personal data - I dont think this is true

"given that Opus 4.7 on Low thinking is strictly better than Opus 4.6 on Medium, etc., etc.”

Opus 4.7 in general is more expensive for similar usage. Now we can argue that is provides better performance all else being equal but I haven’t been able to see that

namnnumbr1mo ago

Following up on "strictly better" via plot in release announcement:

https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-...

unpwn1mo ago

Very unlikely that the article is wrong. the 4.7 intelligence bump is not that big, plus most of the token spend is in inputs/tool calls etc, much of which won't change even with this bump.

1 more reply

_fat_santa1mo ago· 7 in thread

A question I've been asking alot lately (really since the release of GPT-5.3) is "do I really need the more powerful model"?

namnnumbr1mo ago

Yes! I'd be totally happy with today's sonnet 4.6 if I could run it locally.

If this trend continues, perhaps we'll have the capabilities of today's best models available to reasonably run on our laptops!

renticulous1mo ago

Does everyone need a graphing calculator? Does everyone need a scientific calculator? Does everyone need a normal calculator? Does everyone need GeoGebra or Desmos ?

minimaxir1mo ago

Many people were hoping that Sonnet 4.6 was "Opus 4.5 quality but with Sonnet speed/cost" but unfortunately that didn't pan out.

1 more reply

samuelknight1mo ago

Bridged77561mo ago

Efficiency doesn't make as much money. It's in big LLM's best interest to keep inference computationally expensive.

nprateem1mo ago

So you're happy with an untrustworthy lazy moron prone to stupid mistakes and guesswork?

Surely you can see the first lab that solves this gains a massive advantage?

fkealy1mo ago

uberman1mo ago· 4 in thread

On actual code, I see what you see a 30% increase in tokens which is in-line with what they claim as well. I personally don't tend to feed technical documentation or random pros into llms.

Given that Opus 4.6 and even Sonnet 4.6 are still valid options, for me the question is not "Does 4.7 cost more than claimed?" but "What capabilities does 4.7 give me that 4.6 did not?"

Yesterday 4.6 was a great option and it is too soon for me to tell if 4.7 is a meaningful lift. If it is, then I can evaluate if the increased cost is justified.

tetha1mo ago

1 more reply

jstummbillig1mo ago

How to tell / measure effectively? No clue.

1 more reply

pier251mo ago

haven't people been complaining lately about 4.6 getting worse?

2 more replies

grim_io1mo ago

How long will they host 4.6? Maybe longer for enterprise, but if you have a consumer subscription, you won't have a choice for long, if at all anymore.

3 more replies

admiralrohan1mo ago· 2 in thread

Human psychology is surprisingly similar, and same pattern comes across domains.

hirako20001mo ago

It's not just in Kolkata, worldwide packs of biscuits etc remained the same size but less inside.

I didn't buy Springles chips in years, even the box now is nothing like it was. Thinner. Shorter. I imagine how far from the top the slices stack up.

steelbrain1mo ago

See also: Shrinkflation (https://en.wikipedia.org/wiki/Shrinkflation)

1 more reply

montjoy1mo ago· 1 in thread

> xhigh (new): Extra high effort is the best setting for most coding and agentic use cases

Ref: https://platform.claude.com/docs/en/build-with-claude/prompt...

dcrazy1mo ago

Inserting an xhigh tier and pushing max way out has very “these go to 11” vibes.

atonse1mo ago· 7 in thread

I'm already at 27% of my weekly limit in ONE DAY.

https://news.ycombinator.com/item?id=47799256

jabart1mo ago

I'm seeing the opposite. With Opus 4.7 and xhigh, I'm seeing less session usage , it's moving faster, and my weekly usage is not moving that much on a Team Pro account.

cbm-vic-201mo ago

Four day workweek!

richstokes1mo ago

1 more reply

aray07OP1mo ago

yeah similar for me - it uses a bunch more tokens and I haven’t been able to tell the ROI in terms of better instruction following

it seems to hallucinate a bit more (anecdotal)

1 more reply

CharlesW1mo ago

> I'm already at 27% of my weekly limit in ONE DAY.

Ouch, that's very different than experience. What effort level? Are you careful to avoid pushing session context use beyond 350k or so (assuming 1m context)?

2 more replies

sreekanth8501mo ago

Iam at 22%, just two task. A bug fixing and a Scalar integration.

AndyNemmity1mo ago

I'm at 35% :(

sipsi1mo ago

I tried to do my usual test (similar to pelican but a bit more complex) but it ran out of 5 hour limit in 5 minutes. Then after 5 hours I said "go on" and the results were the worst I've ever seen.

ericol1mo ago· 5 in thread

I did some work yesterday with Opus and found it amazing.

Today we are almost on non-speaking terms. I'm asking it to do some simple stuff and he's making incredible stupid mistakes:

    This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?

and at the same time the compacting is firing like crazy. (What adds ~4 minute delays every 1 - 15 minutes)

  | # | Time     | Gap before | Session span | API calls |
  |---|----------|-----------|--------------|-----------|
  | 1 | 15:51:13 | 8s        | <1m          | 1         |
  | 2 | 15:54:35 | 48s       | 37m          | 51        |
  | 3 | 16:33:33 | 2s        | 19m          | 42        |
  | 4 | 16:53:44 | 1s        | 9m           | 30        |
  | 5 | 17:04:37 | 1s        | 17m          | 30        |
  # — sequential compaction event number, ordered by time.

  Time — timestamp of the first API call in the resumed session, i.e. when the new context (carrying the compaction summary) was first sent to the
  model.

  Gap before — time between the last API call of the prior session and the first call of this one. Includes any compaction processing time plus user
   think time between the two sessions.

  Session span — how long this compaction-resumed session ran, from its first API call to its last before the next compaction (or end of session).

  API calls — total number of API requests made during this resumed session. Each tool use, each reply, each intermediate step = one request.

Bottomline, I will probably stay on Sonnet until they fix all these issues.

losvedir1mo ago

> This is the third time that I have to ask you to remove the issue that was there for more than 20 hours. What is going on here?

I don't know if you're giving this as something you've actually given Claude, but I don't think it's a good way of using Claude.

You have to treat this like a tool and understand how it works.

2 more replies

aulin1mo ago

They won't. These are not "issues", it's them trying to push the models to burn less compute. It will only get worse.

2 more replies

cadamsdotcom1mo ago

> he’s making .. mistakes

Claude and other LLMs do not have a gender; they are not a “he”. Your LLM is a pile of weights, prompts, and a harness; anthropomorphising like this is getting in the way.

You’re experiencing what happens when you sample repeatedly from a distribution. Given enough samples the probability of an eventual bad session is 100%.

Just clear the context, roll back, and go again. This is part of the job.

3 more replies

solenoid09371mo ago

This is not how AI works man. Speaking condescendingly or sternly to it WILL result in worse output. Imagine if you spoke to an intern like that, would they make more or less mistakes after?

You should just revert the context and provide more detail and rationale in the message.

whalesalad1mo ago

I am having a shit experience lately. Opus 4.7, max effort.

> You're right, that was a shit explanation. Let me go look at what V1 MTBL actually is before I try again.

> Got it — I read the V1 code this time instead of guessing. Turns out my first take was wrong in an important way. Let me redo this in English.

:facepalm:

5 more replies

yuanzhi12031mo ago· 1 in thread

https://matrix.dev/blog-2026-04-16.html (We were talking to Opus 4.7 twelve days ago)

ec1096851mo ago

Wonder what they do for their token cache if they swap mid-session like that.

1 more reply

jmward011mo ago

taosx1mo ago· 2 in thread

QuercusMax1mo ago

What do you mean by this? What are you frustrated by?

You're offended by their political beliefs, so you don't like the way the model works?

estearum1mo ago

using dumber models to own the libs

1 more reply

margorczynski1mo ago· 3 in thread

It doesn't look good for Anthropic, especially considering they are burning billions in investor money.

Looks like they lost the mandate of heaven, if Open AI plays it right it might be their end. Add to that the open source models from China.

throwaway0412071mo ago

I work at a company that has gone all in on Anthropic, and we're just shoveling money at them. I suspect there are a more enterprises than we realize that are doing this.

solenoid09371mo ago

2. Anthropic and OpenAI's financials are totally different. The former has nearly the same RRR and a fraction of the cash burn. There is a reason Anthropic is hot on secondary and OAI isn't

therobots9271mo ago

OpenAI is dealing with exactly the same energetic and financial constraints as Anthropic. That will become apparent soon.

qq661mo ago· 1 in thread

This is the backdoor way of raising prices... just inflate the token pricing. It's like ice cream companies shrinking the box instead of raising the price

Bridged77561mo ago

2 more replies

jmward011mo ago· 2 in thread

Yeah. I just did a day with 4.7 and I won't be going back for a while. It is just too expensive. On top of the tokenization the thinking seems like it is eating a lot more too.

JimmaDaRustla1mo ago

What was your level methodology and results? Can't just post "too expensive" and not explain how you went about it.

aray07OP1mo ago

yeah i am still not clear why there are 5 effort modes now on top of more expensive tokenization

2 more replies

iknowstuff1mo ago· 2 in thread

Interesting because I already felt like current models spit out too much garbage verbose code that a human would write in a far more terse, beautiful and grokable way

QuercusMax1mo ago

And then it proceeded to rewrite the block with a dict lookup plus if-elses, instead of using match/case. I had to nag it to actually rewrite the code the way it said it would!

aray07OP1mo ago

yeah opus 4.7 feels a lot more verbose - i think they changed the system prompt and removed instructions to be terse in its responses

varispeed1mo ago

sho1mo ago· 1 in thread

Again, the article's numbers are likely a rather crude approximation, but taking 85% accuracy (claude 4.6) vs 90% (4.7) as inputs:

  4.6 1 iteration 85%
  4.7 1 iteration 90%
  4.6 5 iterations 44.37%
  4.7 5 iterations 59.85%
  4.6 10 iterations 19.69%
  4.7 10 iterations 34.87%

Compounded, small improvements really move the needle downstream. 1.4x doesn't seem worth it for 5% better, but 10 calls in, that's more than a 40% improvement.

rohansood151mo ago

You're assuming errors cannot be retried/recovered. They can.

1 more reply

noisy_boy1mo ago

For architectural discussions, sure I'll pick Claude. I'm mentally prepared for that. But once we are in the thick of things, speed matters. I would they rather focus on improving Sonnet's speed.

Yukonv1mo ago

[0] https://she-llac.com/claude-limits

[1] https://xcancel.com/bcherny/status/2044839936235553167

technotony1mo ago· 1 in thread

mfro1mo ago

yeah it seems like sonnet 4.6 burns thru tokens crazy fast. I did one prompt, sonnet misunderstood it as 'generate an image of this' and used all of my free tokens.

therobots9271mo ago

This is already becoming apparent as users are seeing quality degrade which implies that anthropic is dropping performance across the board to minimize financial losses.

adaptive_loop1mo ago· 2 in thread

therobots9271mo ago

That’s the joy of purchasing an intangible and non-deterministic product. The profit margin is completely within the vendor’s control and quality is hard for users to measure.

bityard1mo ago

I used Opus 4.6 with High reasoning for all of this and did not change any model settings between runs.

So I guess 4th time was a charm, and the fare was about $7 in tokens to get there.

sysmax1mo ago· 2 in thread

Well, LLMs are priced per token, and most of the tokens are just echoing back the old code with minimal changes. So, a lot of the cost is actually paying for the LLM to echo back the same code.

mmastrac1mo ago

I think the ideal way for these LLMs to work will be using AST-level changes instead of "let me edit this file".

grit.io was working on this years ago, not sure if they are still alive/around, but I liked their approach (just had a very buggy transformer/language).

gruez1mo ago

>and most of the tokens are just echoing back the old code with minimal changes

I thought coding harnesses provided tools to apply diffs so the LLM didn't have to echo back the entire file?

1 more reply

khalic1mo ago

Just hit my quota with 20x for the first time today…

beej711mo ago· 8 in thread

And if it's not good enough for coding, what kind of money, if any, would make it good enough?

arcanemachiner1mo ago

Do yourself a favor: Set up OpenCode and OpenRouter, and try all the models you want to try there.

1 more reply

bakugo1mo ago

You should be aware that any model you can run on less than $10k worth of hardware isn't going to be anywhere close to the best cloud models on any remotely complex task.

Many providers out there host open weights models for cheap, try them out and see what you think before actually investing in hardware to run your own.

mfro1mo ago

Not sure why all the other commentors are failing to mention you can spend considerably less money on an apple silicon machine to run decent local models.

Fun fact: AWS offers apple silicon EC2 instances you can spin up to test.

__mharrison__1mo ago

My anecdotal experience with a recent project (Python library implemented and released to pypi).

I took the plan that I used from Codex and handed it to opencode with Qwen 3.5 running locally.

It created a library very similar to Codex but took 2x longer.

I haven't tried Qwen 3.6 but I hear it's another improvement. I'm confident with my AI skills that if/when cheap/subsidized models go away, I'll be fine running locally.

efficax1mo ago

DeathArrow1mo ago

Unless you use H100 or 4x 5090 you won't get a decent output.

The best bang for the buck now is subcribing to token plans from Z.ai (GLM 5.1), MiniMax (MiniMax M2.7) or ALibaba Cloud (Qwen 3.6 Plus)

Running quantized models won't give you results comparable to Opus or GPT.

hleszek1mo ago

The latest Qwen3.6 model is very impressive for its size. Get an RTX 3090 and go to https://www.reddit.com/r/LocalLLaMA/ to see the latest news on how to run models locally. Totally fine for coding.

aray07OP1mo ago

i think the new qwen models are supposed to be good based on some the articles that i read

TomGarden1mo ago

encoderer1mo ago

Re-ran the bake-off with 4.7 authoring and… gpt5.4 still clearly winning. Same skills, same prompts, same agents.md.

jstummbillig1mo ago

"One session" is not a very interesting unit of work. What I am interested in is how much less work I am required to do, to get the results I want.

It's harder to measure, of course, but, at least for my coding needs, there is still a lot of room here.

2001zhaozhao1mo ago

FYI: Anthropic increased people's subscription quotas to counteract the token cost change. In classic Anthropic fashion this is only announced via X post and not any official announcement.

However, if you are using API costs then I guess you're left holding the bag.

ndom911mo ago

`/model claude-opus-4-6`

curioussquirrel1mo ago· 1 in thread

arcanemachiner1mo ago

I would encourage you to post a link here, and also to submit to HN if you haven't already. :)

2 more replies

rafram1mo ago

Pretty funny that this article was clearly written by Claude.

avereveard1mo ago

Well yeah it was disclosed here https://www.anthropic.com/_next/image?url=https%3A%2F%2Fwww-... high is the new xhigh

saltyoldman1mo ago

But it looks like it's just creeping up. Probably because we're paying for construction, not just inference right now.

redml1mo ago

bilekas1mo ago

> The model will not silently generalize an instruction from one item to another.

I am clearly missing something but wouldn't this be an ideal thing to do? Surely if it was optimised it would use less tokens while not losing anything from the instructions ?

JimmaDaRustla1mo ago

motbus31mo ago· 1 in thread

I've been using 4.6 models since each of them launched. Same for 4.5.

4.6 performers worse or the same in most of the tasks I have. If there is a parameter that made me use 4.6 more frequently is because 4.5 get dumber and not because 4.6 seemed smarter.

nwienert1mo ago

Agree on all counts, 4.5 was a monster, 4.6 a clear regression, and then 4.5 was dumbed down so I moved on.

epistasis1mo ago· 3 in thread

Anybody else having problem getting Opus 4.7 to write code? I had it pick up a month-old project, some small one off scripts that I want to modify, and it refused to even touch the code.

So far it costs a lot less, because I'm not going to be using it.

apelapan1mo ago

On the contrary, I threw a multi-threading optimization task on it, that 4.5 and 4.6 have been pretty useless at handling. 4.7 bested my hand-tuned solution by almost 2x on first attempt.

1 more reply

GeoAtreides1mo ago

> it refused to even touch the code.

???

please i beg post the prompt and the refusal

I literally can not imagine a model refusing to do something

1 more reply

mrtesthah1mo ago

No, see, we have to leave writing code to fully identity-verified individuals working on behalf of only the largest institutions now because what if they decided to write malware?

2001zhaozhao1mo ago

DiscourseFan1mo ago

kinnth1mo ago

It feels like a dedicated orchestration/planning agent needs to be much clearer on costs now as part of the tast plan. Multiple models used at different stages depending on the task.

blurbleblurble1mo ago

4.7 has been incredibly frustrating vs 4.6. Not sure what's going on but it keeps dropping stuff and getting stuck in weird side quests. Hope it gets fixed cause 4.6 was awesome.

zeronone1mo ago

> Only one instruction type moved materially: change_case:english_capital (0/1 → 1/1). Everything else tied.

So the new tokenizer costs for English/code is to support SHOUTING in English?

lacoolj1mo ago· 1 in thread

This is probably an adjacent result of this (from anthropic launch post):

> In Claude Code, we’ve raised the default effort level to xhigh for all plans.

Try changing your effort level and see what results you get

aray07OP1mo ago

effort level is separate from tokenization. Tokenization impacts you the same regardless.

I find 5 thinking levels to be super confusing - I dont really get why they went from 3 -> 5

e1ghtSpace1mo ago· 1 in thread

Do they ever make AIs that are super rediculously expensive to run but get really good scores on tests, and aren't for consumers? Like drag racing for AI?

m00x1mo ago

Mythos is basically this

sarpdag1mo ago

Since the Opus 4.7 release. I hit my 5 hour window limit second time on claude code max plan, which never happened before. I am not happy for sure.

dionian1mo ago

I noticed it was compacting more aggressively which i actually like, because i was letting sessions get really long and using them uncached (parallel sessions)

aliljet1mo ago· 2 in thread

This is the reality I'm seeing too. Does this mean that the subscriptions (5x, 10x, 20x) are essentially reduced in token-count by 20-30%?

aray07OP1mo ago

yeah thats the part that is unclear to me as well - if our usage capacity is now going to run out faster.

1 more reply

cbg01mo ago

Boris said on Twitter that they've increased rate limits for everyone.

bugsense1mo ago

I would use a service like Straion.com to avoid the forths and back. It increases token consumption but I can get things right the first time.

memcoder1mo ago

depends if you're running Opus for everything vs tiering. my pipeline: Haiku 4.5 for ~70% of implementation, Sonnet 4 for one review step, Opus 4.5 only for planning and final synthesis

claude code on opus continuously = whole bill. different measurement.

haiku 4.5 is good enough for fanout. opus earns it on synthesis where you need long context + complex problem solving under constraints

kburman1mo ago

Anthropic must be loving it. It's free money.

markrogersjr1mo ago· 1 in thread

4.7 one-shot rate is at least 20-30% higher for me

ChicagoBoy111mo ago

How are you able to track this as you use it? A bit stumped atm

1 more reply

dallen331mo ago· 1 in thread

I'm still using Sonnet 4.6 with no issues.

risyachka1mo ago

How does this solve the issue? 4.6 will be disabled after one or more release like any other legacy model.

1 more reply

SpyCoder771mo ago

This begs the question: should we translate our prompts into CJK and translate the output back into English?

omega31mo ago· 1 in thread

CharlesW1mo ago

> Commercial inference providers serve Chinese models of comparable quality…

"Comparable" is doing some heavy lifting there. Comparable to Anthropic models in 1H'25, maybe.

1 more reply

ardline1mo ago

This is the kind of thing that looks simple until you're three layers deep in edge cases.

outlore1mo ago

I can manage session cost effectively myself if forking and rewinds were first class features

stefan_1mo ago· 1 in thread

aray07OP1mo ago

anthropic’s pricing is all based on token usage

https://platform.claude.com/docs/en/about-claude/pricing

So if you are generating more tokens, you are eating up your usage faster

JohnMakin1mo ago

Why release this?

ricardobeat1mo ago

I can’t stand reading this. One article. Many words. Not written by a human.

Feels like LLMs are devolving into having a single, instantly recognizable and predictable writing style.

clbrmbr1mo ago

How can they change the tokenizer without a wholesale pre-train?

Bingolotto1mo ago

Talked to Claude earlier today and Opus 4.7 cost up to 35% more.

CodingJeebus1mo ago· 3 in thread

NickC251mo ago

> but the corollary to that is "this is the most aligned the incentives between model providers and customers will ever be" because we're all just burning VC money for now.

HarHarVeryFunny1mo ago

> The fundamental problem with these frontier model companies is that they're incentivized to create models that burn through more tokens

That's one market segment - the high priced one, but not necessarily the most profitable one. Ferrari's 2025 income was $2B while Toyota's was $30B.

BosunoB1mo ago

thibran1mo ago

rambojohnson1mo ago· 2 in thread

And now maintaining that pace means absorbing arbitrary price increases, shrugged off with “we were operating at a loss anyway.”

It stops being “pay to play” and starts looking more like pay just to stay in the ring, while enterprise players barely feel the hit and everyone else gets squeezed out.

save up now and career pivot. pick up gardening.

wslh1mo ago

> So intelligence has turned into a utility.

"Utility" is close, but "energy source" may be closer. When it becomes the thing powering the pace of work itself, raising prices is less about charging for access and more about taxing dependency.

colechristensen1mo ago

Like every startup ever, they were selling it to you at a loss to compete for market share and are slowly increasing pricing. Duh.

1 more reply

chakintosh1mo ago

Yeah one PRD request of a small scope app cost me 70%

rbren1mo ago

Good reminder to choose model-agnostic tooling!

wartywhoa231mo ago

Seeing this big crowd of people trying to persuade themselves or others that the ever growing hole in their pockets is totally justified and beneficial is pretty hilarious!

Frannky1mo ago

Give it a try to opencode + mimo V2 pro...

Abderahmane1mo ago

Bonjour

AIrtemis1mo ago

here comes the rug-pull to justify the enterprise pricing...

olq_plo1mo ago

That blog post is full of AI slop. Repeats the same argument a gazillion times. It's not X, it's Y. Awful to read.

greatgib1mo ago

synergy201mo ago

that's what i feel, going to use codex more

tornikeo1mo ago