Claude Opus 4.5 (opens in new tab)

(anthropic.com)

1113 pointsadocomplete4mo ago506 comments

https://platform.claude.com/docs/en/about-claude/models/what...

506 comments

The burying of the lede here is insane. $5/$25 per MTok is a 3x price drop from Opus 4. At that price point, Opus stops being "the model you use for important things" and becomes actually viable for production workloads.

Also notable: they're claiming SOTA prompt injection resistance. The industry has largely given up on solving this problem through training alone, so if the numbers in the system card hold up under adversarial testing, that's legitimately significant for anyone deploying agents with tool access.

The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.

tekacs4mo ago

This is also super relevant for everyone who had ditched Claude Code due to limits:

> For Claude and Claude Code users with access to Opus 4.5, we’ve removed Opus-specific caps. For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet. We’re updating usage limits to make sure you’re able to use Opus 4.5 for daily work.

tifik4mo ago

I like that for this brief moment we actually have a competitive market working in favor of consumers. I ditched my Claude subscription in favor of Gemini just last week. It won't be great when we enter the cartel equilibrium.

1 more reply

Aeolun4mo ago

It’s important to note that with the introduction of Sonnet 4.5 they absolutely cratered the limits, and the opus limits in specific, so this just sort of comes closer to the situation we were actually in before.

1 more reply

js4ever4mo ago

Interesting. I totally stopped using opus on my max subscription because it was eating 40% of my week quota in less than 2h

TrueDuality4mo ago

Now THAT is great news

1 more reply

throwaway-aws93mo ago

Thanks. I unsubscribed when I busted my weekly limit in a few hours on the Max 20x plan when I had to use Opus over Sonnet. It really feels like they were off by an order of magnitude at some point when limits were introduced.

brianjking4mo ago

They also reset limits today, which was also quite kind as I was already 11% into my weekly allocation.

astrange4mo ago

Just avoid using Claude Research, which I assume still instantly eats most of your token limits.

sqs4mo ago

What's super interesting is that Opus is cheaper all-in than Sonnet for many usage patterns.

Here are some early rough numbers from our own internal usage on the Amp team (avg cost $ per thread):

- Sonnet 4.5: $1.83

- Opus 4.5: $1.30 (earlier checkpoint last week was $1.55)

- Gemini 3 Pro: $1.21

Cost per token is not the right way to look at this. A bit more intelligence means mistakes (and wasted tokens) avoided.

localhost4mo ago

Totally agree with this. I have seen many cases where a dumber model gets trapped in a local minima and burns a ton of tokens to escape from it (sometimes unsuccessfully). In a toy example (30 minute agentic coding session - create a markdown -> html compiler using a subset of commonmark test suite to hill climb on), dumber models would cost $18 (at retail token prices) to complete the task. Smarter models would see the trap and take only $3 to complete the task. YMMV.

Much better to look at cost per task - and good to see some benchmarks reporting this now.

1 more reply

leo_e4mo ago

Hard agree. The hidden cost of 'cheap' models is the complexity of the retry logic you have to write around them.

If a cheaper model hallucinates halfway through a multi-step agent workflow, I burn more tokens on verification and error correction loops than if I just used the smart model upfront. 'Cost per successful task' is the only metric that matters in production.

andai4mo ago

Yeah, that's a great point.

ArtificialAnalysis has a "intelligence per token" metric on which all of Anthropic's models are outliers.

For some reason, they need way less output tokens than everyone else's models to pass the benchmarks.

(There are of course many issues with benchmarks, but I thought that was really interesting.)

tmaly4mo ago

what is the typical usage pattern that would result in these cost figures?

1 more reply

sharkjacobs4mo ago

3x price drop almost certainly means Opus 4.5 is a different and smaller base model than Opus 4.1, with more fine tuning to target the benchmarks.

I'll be curious to see how performance compares to Opus 4.1 on the kind of tasks and metrics they're not explicitly targeting, e.g. eqbench.com

nostrademons4mo ago

Why? They just closed a $13B funding round. Entirely possible that they're selling below-cost to gain marketshare; on their current usage the cloud computing costs shouldn't be too bad, while the benefits of showing continued growth on their frontier models is great. Hell, for all we know they may have priced Opus 4.1 above cost to show positive unit economics to investors, and then drop the price of Opus 4.5 to spur growth so their market position looks better at the next round of funding.

4 more replies

ACCount374mo ago

Probably more sparse (MoE) than Opus 4.1. Which isn't a performance killer by itself, but is a major concern. Easy to get it wrong.

cootsnuck4mo ago

We already know distillation works pretty well. So definitely would make sense Opus 4.5 is effectively smaller (like someone else said, could be via MoE or some other technique too).

We know the big labs are chasing efficiency cans where they can.

adgjlsfhk14mo ago

It seems plausible that it's a similar size model and that the 3x drop is just additional hardware efficiency/lowered margin.

2 more replies

losvedir4mo ago

I almost scrolled past the "Safety" section, because in the past it always seemed sort of silly sci-fi scaremongering (IMO) or things that I would classify as "sharp tool dangerous in the wrong hands". But I'm glad I stopped, because it actually talked about real, practical issues like the prompt injections that you mention. I wonder if the industry term "safety" is pivoting to refer to other things now.

shepherdjerred4mo ago

I thought AI safety was dumb/unimportant until I saw this dataset of dangerous prompts: https://github.com/mlcommons/ailuminate/blob/main/airr_offic...

I don't love the idea of knowledge being restricted... but I also think these tools could result in harm to others in the wrong hands

7 more replies

wkat42424mo ago

Jailbreaking is trivial though. If anything really bad could happen it would have happened already.

And the prudeness of American models in particular is awful. They're really hard to use in Europe because they keep closing up on what we consider normal.

NiloCK4mo ago

Waymos, LLMs, brain computer interfaces, dictation and tts, humanoid robots that are worth a damn.

Ye best start believing in silly sci-fi stories. Yer in one.

narrator4mo ago

Pliney the Liberator jailbroke it in no time. Not sure if this applies to prompt injection:

https://x.com/elder_plinius/status/1993089311995314564

cmrdporcupine4mo ago

Note the comment when you start claude code:

"To give you room to try out our new model, we've updated usage limits for Claude Code users."

That really implies non-permanence.

Xlr8head4mo ago

Still better than perma-nonce.

AtNightWeCode4mo ago

The cost of tokens in the docs is pretty much a worthless metric for these models. Only way to go is to plug it in and test it. My experience is that Claude is an expert at wasting tokens on nonsense. Easily 5x up on output tokens comparing to ChatGPT and then consider that Claude waste about 2-3x of tokens more by default.

windexh8er4mo ago

This is spot on. The amount of wasteful output tokens from Claude is crazy. The actual output you're looking for might be better, but you're definitely going to pay for it in the long run.

The other angle here is that it's very easy to waste a ton of time and tokens with cheap models. Or you can more slowly dig yourself a hole with the SOTA models. But either way, and even with 1M tokens of context - things spiral at some point. It's just a question of whether you can get off the tracks with a working widget. It's always frustrating to know that "resetting" the environment is just handing over some free tokens to [model-provider-here] to recontextualize itself. I feel like it's the ultimate Office Space hack, likely unintentional, but really helps drive home the point of how unreliable all these offerings are.

1 more reply

Scene_Cast24mo ago

Still way pricier (>2x) than Gemini 3 and Grok 4. I've noticed that the latter two also perform better than Opus 4, so I've stopped using Opus.

pants24mo ago

Don't be so sure - while I haven't tested Opus 4.5 yet, Gemini 3 tends to use way more tokens than Sonnet 4.5. Like 5-10X more. So Gemini might end up being more expensive in practice.

2 more replies

wolttam4mo ago

It's 1/3 the old price ($15/$75)

brookst4mo ago

Not sure if that’s a joke about LLM math performance, but pedantry requires me to point out 15 / 75 = 1/5

4 more replies

llamasushi4mo ago

Just updated, thanks

tom_m4mo ago

It was already viable pricing before. You have to remember this is for business use. Many companies will pay 20% on top of an engineer's salary to have them be 200% as effective. Right?

I am truthfully surprised they dropped pricing. They don't really need to. The demand is quite high. This is all pretty much gatekeeping too (with the high pricing, across all providers). AI for coding can be expensive and companies want it to be because money is their edge. Funny because this is the same for the AI providers too. He who had the most GPUs, right?

resonious4mo ago

Just on Claude Code, I didn't notice any performance difference from Sonnet 4.5 but if it's cheaper then that's pretty big! And it kinda confuses the original idea that Sonnet is the well rounded middle option and Opus is the sophisticated high end option.

jstummbillig4mo ago

It does, but it also maps to the human world: Tokens/Time cost money. If either is well spent, then you save money. Thus, paying an expert ends up costing less than hiring a novice, who might cost less per hour, but takes more hours to complete the task, if they can do it at all.

It's both kinda neat and irritating, how many parallels there are between this AI paradigm and what we do.

burgerone4mo ago

Using AI in production is no doubt an enormous security risk...

laterium4mo ago

Where's the argument? Or we're just asserting things?

delaminator4mo ago

Not all production processes untrusted input.

irthomasthomas4mo ago

It's about double the speed of 4.1, too. ~60t/s vs ~30t/s. I wish it where openweights so we could discuss the architectural changes.

RestartKernel4mo ago

> [...] that's legitimately significant for anyone deploying agents with tool access.

I disagree, even if only because your model shouldn't have more access than any other front-end.

antihero4mo ago

Also it's really really good. Scarily good tbh. It's making PRs that work and aren't slop-filled and it figures out problems and traces through things in a way a competent engineer would rather than just fucking about.

consumer4514mo ago

> Claude Opus 4.5 in Windsurf for 2x credits (instead of 20x for Opus 4.1)

https://old.reddit.com/r/windsurf/comments/1p5qcus/claude_op...

At the risk of sounding like a shill, in my personal experience, Windsurf is somehow still the best deal for an agentic VSCode fork.

zwnow4mo ago

Why do all these comments sound like a sales pitch? Everytime some new bullshit model is released there are hundreds of comments like this one, pointing out 2 features talking about how huge all of this is. It isn't.

unsupp0rted4mo ago

This is gonna be game-changing for the next 2-4 weeks before they nerf the model.

Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.

Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases” might have lead to less than optimal performance. Still largely a user skill issue though.

Then a couple months later they’ll release Opus 4.7 and go through the cycle again.

My allegiance to these companies is now measured in nerf cycles.

I’m a nerf cycle customer.

lukev4mo ago

There are two possible explanations for this behavior: the model nerf is real, or there's a perceptual/psychological shift.

However, benchmarks exist. And I haven't seen any empirical evidence that the performance of a given model version grows worse over time on benchmarks (in general.)

Therefore, some combination of two things are true:

1. The nerf is psychologial, not actual. 2. The nerf is real but in a way that is perceptual to humans, but not benchmarks.

#1 seems more plausible to me a priori, but if you aren't inclined to believe that, you should be positively intrigued by #2, since it points towards a powerful paradigm shift of how we think about the capabilities of LLMs in general... it would mean there is an "x-factor" that we're entirely unable to capture in any benchmark to date.

davidsainez4mo ago

There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-....

The real issue is that there is no reliable system currently in place for the end user (other than being willing to burn the cash and run your own benchmarks regularly) to detect changes in performance.

It feels to me like a perfect storm. A combination of high cost of inference, extreme competition, and the statistical nature of LLMs make it very tempting for a provider to tune their infrastructure in order to squeeze more volume from their hardware. I don't mean to imply bad faith actors: things are moving at breakneck speed and people are trying anything that sticks. But the problem persists, people are building on systems that are in constant flux (for better or for worse).

1 more reply

jaggs4mo ago

https://www.youtube.com/watch?v=DtePicx_kFY

"There's something still not quite right with the current technology. I think the phrase that's becoming popular is 'jagged intelligence'. The fact that you can ask an LLM something and they can solve literally a PhD level problem, and then in the next sentence they can say something so clearly, obviously wrong that it's jarring. And I think this is probably a reflection of something fundamentally wrong with the current architectures as amazing as they are."

Llion Jones, co-inventor of transformers architecture

1 more reply

data-ottawa4mo ago

The previous “nerf” was actually several bugs that dramatically decreased performance for weeks.

I do suspect continued fine tuning lowers quality — stuff they roll out for safety/jailbreak prevention. Those should in theory buildup over time with their fine tune dataset, but each model will have its own flaws that need tuning out.

I do also suspect there’s a bit of mental adjustment that goes in too.

blurbleblurble4mo ago

I'm pretty sure this isn't happening with the API versions as much as with the "pro plan" (loss leader priced) routers. I imagine that there are others like me working on hard problems for long periods with the model setting pegged to high. Why wouldn't the companies throttle us?

It could even just be that they just apply simple rate limits and that this degrades the effectiveness of the feedback loop between the person and the model. If I have to wait 20 minutes for GPT-5.1-codex-max medium to look at `git diff` and give a paltry and inaccurate summary (yes this is where things are at for me right now, all this week) it's not going to be productive.

1 more reply

imiric4mo ago

Or, 2b: the nerf is real, but benchmarks are gamed and models are trained to excel at them, yet fall flat in real world situations.

1 more reply

camdenreslink4mo ago

As a personal anecdote, I had a fairly involved application that built up a context with a lot of custom prompting and created a ~1000 word output. I could run my application over and over again to inspect the results. It was fairly reproducible.

I was having really nice results with the o4-mini model with high thinking. A little while after GPT-5 came out I revisited my application and tried to continue. The o4-mini results were unusable, while the GPT-5 results were similar to what I had before. I'm not sure what happened to the model in those ~4-5 months I set it down, but there was real degradation.

parineum4mo ago

Is there a reason not to think that, when "refining" the models they're using the benchmarks as the measure and it shows no fidelity loss but in some unbenchmarked ways, the performance is worse. "Once a measure becomes a target, it's no longer a useful measure."

That's case #2 for you but I think the explanation I've proposed is pretty likely.

conception4mo ago

The only time Ive seen benchmark nerfing is I saw one see a drop in performance between 2.5 march preview and release.

csomar4mo ago

They are nerfed and there is actually a very simple test to prove otherwise: 0 temperature. This is only allowed with the API where you are billed full token prices.

Conclusion: It is nerfed unless Claude can prove otherwise.

1 more reply

teruakohatu4mo ago

> 1. The nerf is psychologial, not actual. 2. The nerf is real but in a way that is perceptual to humans, but not benchmarks.

They could publish weekly benchmarks. To disprove. They almost certainly have internal benchmarking.

The shift is certainly real. It might not be model performance but contextual changes or token performance (tasks take longer even if the model stays the same).

1 more reply

yawnxyz4mo ago

moving onto new hardware + caching + optimizations might actually change the output slightly; it'll still pass evals all the same but on the edges it just "feels weird" - and that's what makes it feel like it's nerfed

zsoltkacsandi4mo ago

> The nerf is psychologial, not actual

Once I tested this, I gave the same task for a model after the release and a couple weeks later. In the first attempt it produced a well-written code that worked beautifully, I started to worry about the jobs of the software engineers. Second attempt was a nightmare, like a butcher acting as a junior developer performing a surgery on a horse.

Is this empirical evidence?

And this is not only my experience.

Calling this phychological is gaslighting.

5 more replies

film424mo ago

This is why I migrated my apps that need an LLM to Gemini. No model degradation so far all through the v2.5 model generation. What is Anthropic doing? Swapping for a quantized version of the model?

TIPSIO4mo ago

Hilarious sarcastic comment but actually true sentiment.

For all we know this is just the Opus 4.0 re-released

nullbio4mo ago

You're forgetting the step where they write a nefarious paper for their marketing team about the "world-ending dangers" of the capabilities they've discovered within their new model, and push it out to their web of media companies who make bank from the ad-revenue from clicks on their doomsday articles while furthering the regulatory capture goals of the hypocritically Palantir-partnered Anthropic.

arresin4mo ago

And then Dario gives an interview on why open source models should be banned due to _____.

1 more reply

yesco4mo ago

With Claude specifically I've grown confident they have been sneakily experimenting with context compression to save money and doing a very bad job at it. However for this same reason one shot batch usage or one off questions & answers that don't depend on larger context windows don't seem to see this degradation.

unshavedyak4mo ago

They added a "How is claude doing?" rating a while back which backs this statement up imo. Tons of A/B tests going on i bet.

1 more reply

all24mo ago

Interestingly, I canceled my Claude subscription. I've paid through the first week of December, so it dries up on the 7th of December. As soon as I had canceled, Claude Code started performing substantially better. I gave it a design spec (a very loose design spec) and it one-shotted it. I'll grant that it was a collection of docker containers and a web API, but still. I've not seen that level of performance from Claude before, and I'm thinking I'll have to move to 'pay as you go' (pay --> cancel immediately) just to take advantage of this increased performance.

blinding-streak4mo ago

That's really interesting. After cancelling, it goes into retention mode, akin to when one cancels other online services? For example, I cancelled Peacock the other day and it offered a deal of $1.99/mo for 6 months if I stayed.

Very intriguing, curious if others have seen this.

1 more reply

F7F7F74mo ago

Couldn’t have said it better myself. I’ve cancelled my x20 two times now and they keep pulling me back.

hacb4mo ago

it will be just enough time to finish my quarter roadmap and chill until january

vagab0nd4mo ago

I did not know this but it's consistent with the behaviors of the CEO.

idonotknowwhy4mo ago

100%. They've been nerfing the model periodically since at least Sonnet 3.5, but this time it's so bad I ended up swapping out to GLM4.6 just to finish off a simple feature.

Capricorn24814mo ago

Thank god people are noticing this. I'm pretty sick of companies putting a higher number next to models and programmers taking that at face value.

This reminds me of audio production debates about niche hardware emulations, like which company emulated the 1176 compressor the best. The differences between them all are so minute and insignificant, eventually people just insist they can "feel" the difference. Basically, whoever is placeboing the hardest.

Such is the case with LLMs. A tool that is already hard to measure because it gives different output with the same repeated input, and now people try to do A/B tests with models that are basically the same. The field has definitely made strides in how small models can be, but I've noticed very little improvement since gpt-4.

fHr4mo ago

haha couldn't have put this better, exactly this

WNWceAJ9R9Ezc44mo ago

Accurate.

stingraycharles4mo ago

I’m disappointed that this type of discourse has now entered HN. I expected a more evidence-based less “nerf cycle” discussion over here.

Etheryte4mo ago

This is nothing new and it's been discussed numerous times. Would you also say we need more evidence that Meta is tracking people?

1 more reply

blurbleblurble4mo ago

I fully agree that this is what's happening. I'm quite convinced after about a year of using all these tools via the "pro" plans that all these companies are throttling their models in sophisticated ways that have a poorly understood but significant impact on quality and consistency.

Gpt-5.1-* are fully nerfed for me at the moment. Maybe they're giving others the real juice but they're not giving it to me. Gpt-5-* gave me quite good results 2 weeks ago, now I'm just getting incoherent crap at 20 minute intervals.

Maybe I should just start paying via tokens for a hopefully more consistent experience.

throwuxiytayq4mo ago

y’all hallucinating harder than GPT2 on DMT

2 more replies

827a4mo ago

I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think Anthropic is making the right decisions with their models. Given that software engineering is probably one of the very few domains of AI usage that is driving real, serious revenue: I have far better feelings about Anthropic going into 2026 than any other foundation model. Excited to put Opus 4.5 through its paces.

mritchie7124mo ago

> only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think part of it is this[0] and I expect it will become more of a problem.

Claude models have built-in tools (e.g. `str_replace_editor`) which they've been trained to use. These tools don't exist in Cursor, but claude really wants to use them.

0 - https://x.com/thisritchie/status/1944038132665454841?s=20

bgrainger4mo ago

This feels like a dumb question, but why doesn't Cursor implement that tool?

I built my own simple coding agent six months ago, and I implemented str_replace_based_edit_tool (https://platform.claude.com/docs/en/agents-and-tools/tool-us...) for Claude to use; it wasn't hard to do.

3 more replies

HugoDias4mo ago

TIL! I'll finally give Claude Code a try. I've been using Cursor since it launched and never tried anything else. The terminal UI didn't appeal to me, but knowing it has better performance, I'll check it out.

Cursor has been a terrible experience lately, regardless of the model. Sometimes for the same task, I need to try with Sonnet 4.5, ChatGPT 5.1 Codex, Gemini Pro 3... and most times, none managed to do the work, and I end up doing it myself.

At least I’m coding more again, lol

4 more replies

vunderba4mo ago

My workflow was usually to use Gemini 2.5 Pro (now 3.0) for high-level architecture and design. Then I would take the finished "spec" and have Sonnet 4.5 perform the actual implementation.

nevir4mo ago

Same here. Gemini really excels at all the "softer" parts of the development process (which, TBH, feels like most of the work). And Claude kicks ass at the actual code authoring.

It's a really nice workflow.

config_yml4mo ago

I use plan mode in claude code, then use gpt-5 in codex to review the plan and identify gaps and feed it back to claude. Results are amazing.

3 more replies

SkyPuncher4mo ago

This is how I do it. Though, I've been using Composer as my main driver more an more.

* Composer - Line-by-Line changes * Sonnet 4.5 - Task planning and small-to-medium feature architecture. Pass it off to Composer for code * Gemini Pro - Large and XL architecture work. Pass it off to Sonnet to breakdown into tasks.

vessenes4mo ago

I like this plan, too - gemini's recent series have long seemed to have the best large context awareness vs competing frontier models - anecdotally, although much slower, I think gpt-5's architecture plans are slightly better.

jeswin4mo ago

Same here. But with GPT 5.1 instead of Gemini.

UltraSane4mo ago

I've done this and it seems to work well. I ask Gemini to generate a prompt for Claude Code to accomplish X

joewhale4mo ago

What specific output would you ask Gemini to create for Sonnet? Thanks in advance!

lvl1554mo ago

I really don’t understand the hype around Gemini. Opus/Sonnet/GPT are much better for agentic workflows. Seems people get hyped for the first few days. It also has a lot to do with Claude code and Codex.

int_19h4mo ago

Gemini is a lot more bang for the buck. It's not just cheaper per token, but with the subscription, you also get e.g. a lot more Deep Research calls (IIRC it's something like 20 per day) compared to Anthropic offerings.

Also, Gemini has that huge context window, which depending on the task can be a big boon.

1 more reply

egeozcan4mo ago

I'm completely the opposite. I find Gemini (even 2.5 Pro) much, much better than anything else. But I hate agentic flows, I upload the full context to it in aistudio and then it shines - anything agentic cannot even come close.

2 more replies

jdgoesmarching4mo ago

Personally my hype is for the price, especially for Flash. Before Sonnet 4.5 was competitive with Gemini 2.5 Pro, the latter was a much better value than Opus 4.1.

thousand_nights4mo ago

with gemini you have to spend 30 minutes deleting hundreds of useless comments littered in the code that just describe what the code itself does

2 more replies

chinathrow4mo ago

I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

It gave me the Youtube-URL to Rick Astley.

arghwhat4mo ago

If you're asking an LLM to compute something "off the top of its head", you're using it wrong. Ask it to write the code to perform the computation and it'll do better.

Same with asking a person to solve something in their head vs. giving them an editor and a random python interpreter, or whatever it is normal people use to solve problems.

2 more replies

hu34mo ago

> I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

This is what I imagine the LLM usage of people who tell me AI isn't helpful.

It's like telling me airplanes aren't useful because you can't use them in McDonald's drive-through.

astrojams4mo ago

I find it hilarious that it rick rolled you. I wonder if that is an easter egg of some sort?

mikestorrent4mo ago

You should probably tell AI to write you programs to do tasks that programs are better at than minds.

stavros4mo ago

Don't use LLMs for a task a human can't do, they won't do it well.

1 more reply

idonotknowwhy4mo ago

Almost any modern LLM can do this, even GPT-OSS

gregable4mo ago

it. Not him.

2 more replies

emodendroket4mo ago

Yeah I think Sonnet is still the best in my experience but the limits are so stingy I find it hard to recommend for personal use.

visioninmyblood4mo ago

The model is great it is able to code up some interesting visual tasks(I guess they have pretty strong tool calling capapbilities). Like orchestrate prompt -> image generate -> Segmentation -> 3D reconstruction. Checkout the results here https://chat.vlm.run/c/3fcd6b33-266f-4796-9d10-cfc152e945b7. Note the model was only used to orchestrate the pipeline, the tasks are done by other models in an agentic framework. They much have improved tool calling framework with all the MCP usage. Gemini 3 was able to orchestrate the same but Claude 4.5 is much faster

consumer4514mo ago

I have a side-project prototype app that I tried to build on the Gemini 2.5 Pro API. I have not tried 3 yet, however the only improvements I would like to see is in Gemini's ability to:

1. Follow instructions consistently

2. API calls to not randomly result in "resource exhausted"

Can anyone share their experience with either of these issues?

I have built other projects accessing Azure GPT-4.1, Bedrock Sonnet 4, and even Perplexity, and those three were relatively rock solid compared to Gemini.

herrvogel-4mo ago

What you describe could also be the difference in the hallucination rate [0]. Opus 4.5 has the lead here and Gemini 3 Pro performs here quite bad compared to the other benchmarks.

[0] https://artificialanalysis.ai/?omniscience=omniscience-hallu...

rustystump4mo ago

Gemini 3 was awful when i gave it a spin. It was worse than cursor’s composer model.

Claude is still a go to but i have found that composer was “good enough” in practice.

arendtio3mo ago

I think the 'Agentic coding SWE-Bench Verified' [1] was actually the one benchmark where Google didn't even claim to beat Sonnet 4.5 ;-)

[1] https://deepmind.google/models/gemini/pro/

lossolo4mo ago

I've had problems solved incorrectly and edge cases missed by Sonnet and by other LLMs (ChatGPT, Gemini) and the other way around too. Once they saw the other model's answer, they admitted their "critical mistake". It's all about how much of your prompt/problem/context falls outside the model's training distribution.

lxgr4mo ago

> I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5.

That's my experience too. It's weirdly bad at keeping track of its various output channels (internal scratchpad, user-visible "chain of thought", and code output), not only in Cursor but also on gemini.google.com.

1 more reply

verdverm4mo ago

> played around with

You'll never get an accurate comparison if you only play

We know by now that it takes time to "get to know a model and it's quirks"

So if you don't use a model and cannot get equivalent outputs to your daily driver, that's expected and uninteresting

827a4mo ago

I rotate models frequently enough that I doubt my personal access patterns are so model specific that they would unfairly advantage one model over another; so ultimately I think all you're saying is that Claude might be easier to use without model-specific skilling than other models. Which might be true.

I certainly don't have as much time on Gemini 3 as I do on Claude 4.5, but I'd say my time with the Gemini family as a whole is comparable. Maybe further use of Gemini 3 will cause me to change my mind.

1 more reply

rishabhaiover4mo ago

I suspect Cursor is not the right platform to write code on. IMO, humans are lazy and would never code on Cursor. They default to code generation via prompt which is sub-optimal.

viraptor4mo ago

> They default to writing code via prompt generation which is sub-optimal.

What do you mean?

1 more reply

Squarex4mo ago

I have heard that gemini 3 is not that great in cursor, but excellent in Antigravity. I don't have a time to personally verify all that though.

config_yml4mo ago

I‘ve had no success using Antigravity, which is a shame because the ideas are promising, but the execution so far is underwhelming. Haven‘t gotten past an initial plannin doc which is usually aborted due to model provider overload or rate limiting.

2 more replies

itsdrewmiller4mo ago

My first couple of attempts at antigravity / Gemini were pretty bad - the model kept aborting and it was relatively helpless at tools compared to Claude (although I have a lot more experience tuning Claude to be fair). Seems like there are some good ideas in antigravity but it’s more like an alpha than a product.

koakuma-chan4mo ago

Nothing is great in Cursor.

vanviegen4mo ago

It's just not great at coding, period. In Antigravity it takes insane amounts of time and tokens for tasks that copilot/sonnet would solve in 30 seconds.

It generates tokens pretty rapidly, but most of them are useless social niceties it is uttering to itself in it's thinking process.

incoming12114mo ago

I think gemini 3 is hot garbage in everything. Its great on a greenfield trying to 1 shot something, if you're working on a long term project it just sucks.

UltraSane4mo ago

I've had Gemini 3 Pro solve issues that Claude Code failed to solve after 10 tries. It even insulted some code that Sonnet 4.5 generated

victor90004mo ago

I'm also finding Gemini 3 (via Gemini CLI) to be far superior to Claude in both quality and availability. I was hitting Claude limits every single day, at that point it's literally useless.

1 more reply

typpilol4mo ago

Same here. Gemini just rips shit out and doesn't understand the flow well between event based components either

rw24mo ago

Gemini 3 in antigravity is amazing

screye4mo ago

Gemini being terrible in Cursor is a well known problem.

Unfortunately, for all its engineers, Google seems the most incompetent at product work.

arresin4mo ago

Gemini pro 3 was a let down for me too

poszlem4mo ago

I’ve trashed Gemini non-stop (seriously, check my history on this site), but 3 Pro is the one that finally made me switch from OpenAI. It’s still hot garbage at coding next to Claude, but for general stuff, it’s legit fantastic.

jjcm4mo ago

Tangental observation - I've noticed Gemini 3 Pro's train of thought feels very unique. It has kind of an emotive personality to it, where it's surprised or excited by what it finds. It feels like a senior developer looking through legacy code and being like, "wtf is this??".

I'm curious if this was a deliberate effort on their part, and if they found in testing it provided better output. It's still behind other models clearly, but nonetheless it's fascinating.

1 more reply

enraged_camel4mo ago

My testing of Gemini 3 Pro in Cursor yielded mixed results. Sometimes it's phenomenal. At other times I either get the "provider overloaded" message (after like 5 mins or whatever the timeout is), or the model's internal monologue starts spilling out to the chat window, which becomes really messy and unreadable. It'll do things like:

>> I'll execute.

>> Wait, what if...?

>> I'll execute.

Suffice it to say I've switched back to Sonnet as my daily driver. Excited to give Opus a try.

1 more reply

behnamoh4mo ago

i’ve tried Gemini in Google AI studio as well and was very disappointed by the superficial responses it provided. It seems like at the level of GPT-5-low or even lower.

On the other hand, it’s a truly multi modal model whereas Claude remains to be specifically targeted at coding tasks, and therefore is only a text model.

dave1010uk4mo ago

The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.

There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded but then hides that info from the user.

The risks are a bit scary, especially around CBRNs. Opus is still only ASL-3 (systems that substantially increase the risk of catastrophic misuse) and not quite at ASL-4 (uplifting a second-tier state-level bioweapons programme to the sophistication and success of a first-tier one), so I think we're fine...

I've never written a blog post about a model release before but decided to this time [1]. The system card has quite a few surprises, so I've highlighted some bits that stood out to me (and Claude, ChatGPT and Gemini).

[0] https://www.anthropic.com/claude-opus-4-5-system-card

[1] https://dave.engineer/blog/2025/11/claude-opus-4.5-system-ca...

aurareturn4mo ago

  Pages 22–24 of Opus’s system card provide some evidence for this. Anthropic run a multi-agent search benchmark where Opus acts as an orchestrator and Haiku/Sonnet/Opus act as sub-agents with search access. Using cheap Haiku sub-agents gives a ~12-point boost over Opus alone.

Will this lead to another exponential in capabilities and token increase in the same order as thinking models?

1 more reply

bnchrch4mo ago

Seeing these benchmarks makes me so happy.

Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.

This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.

Ive been holding onto Claude Code for the last little while since Ive built up a robust set of habits, slash commands, and sub agents that help me squeeze as much out of the platform as possible.

But with the last few releases of Gemini and Codex I've been getting closer and closer to throwing it all out to start fresh in a new ecosystem.

Thankfully Anthropic has come out swinging today and my own SOP's can remain in tact a little while longer.

hakanderyal4mo ago

I think we are at the point where you can reliably ignore the hype and not get left behind. Until the next breakthrough at least.

I've been using Claude Code with Sonnet since August, and there haven't been any case where I thought about checking other models to see if they are any better. Things just worked. Yes, requires effort to steer correctly, but all of them do with their own quirks. Then 4.5 came, things got better automatically. Now with Opus, another step forward.

I've just ignored all the people pushing codex for the last weeks.

Don't fall into that trap and you'll be much more productive.

2 more replies

tordrt4mo ago

I tried codex due to the same reasoning you list. The grass is not greener on the other side.. I usually only opt for codex when my claude code rate limit hits.

bavell4mo ago

Same boat and same thoughts here! Hope it holds its own against the competition, I've become a bit of a fan of Anthropic and their focus on devs.

diego_sandoval4mo ago

I personally jumped ship from Claude to OpenAI due to the rate-limiting in Claude, and have no intention of coming back unless I get convinced that the new limits are at least double of what they were when I left.

Even if the code generated by Claude is slightly better, with GPT, I can send as many requests as I want and have no fear or running into any limit, so I feel free to experiment and screw up if necessary.

1 more reply

adriand4mo ago

Don't throw away what's working for you just because some other company (temporarily) leapfrogs Anthropic a few percent on a benchmark. There's a lot to be said for what you're good at.

I also really want Anthropic to succeed because they are without question the most ethical of the frontier AI labs.

2 more replies

wahnfrieden4mo ago

You need much less of a robust set of habits, commands, sub agent type complexity with Codex. Not only because it lacks some of these features, it also doesn't need them as much.

sothatsit4mo ago

The benefit you get from juggling different tools is at best marginal. In terms of actually getting work done, both Sonnet and GPT-5.1-Codex are both pretty effective. It looks like Opus will be another meaningful, but incremental, change, which I am excited about but probably won’t dramatically change how much these tools impact our work.

1 more reply

Stevvo4mo ago

With Cursor or Copilot+VSCode, you get all the models, can switch any time. When a new model is announced its available same day.

1 more reply

edf134mo ago

I’m threw a few hours at Codex the other day and was incredibly disappointed with the outcome…

I’m a heavy Claude code user and similar workloads just didn’t work out well for me on Codex.

One of the areas I think is going to make a big difference to any model soon is speed. We can build error correcting systems into the tools - but the base models need more speed (and obviously with that lower costs)

2 more replies

futureshock4mo ago

A really great way to get an idea of the relative cost and performance of these models at their various thinking budgets is to look at the ARC-AGI-2 leaderboard. Opus 4.5 stacks up very well here when you compare to Gemini 3’s score and cost. Gemini 3 Deep Think is still the current leaders but at more than 30x the cost.

The cost curve of achieving these scores is coming down rapidly. In Dec 2024 when OpenAI announced beating human performance on ARC-AGI-1, they spent more than $3k per task. You can get the same performance for pennies to dollars, approximately an 80x reduction in 11 months.

https://arcprize.org/leaderboard

https://arcprize.org/blog/oai-o3-pub-breakthrough

energy1234mo ago

A point of context. On this leaderboard, Gemini 3 Pro is "without tools" and Gemini 3 Deep Think is "with tools". In the other benchmarks released by Google which compare these two models, where they have access to the same amount of tools, the gap between them is small.

stavros4mo ago

Did anyone else notice Sonnet 4.5 being much dumber recently? I tried it today and it was really struggling with some very simple CSS on a 100-line self-contained HTML page. This never used to happen before, and now I'm wondering if this release has something to do with it.

On-topic, I love the fact that Opus is now three times cheaper. I hope it's available in Claude Code with the Pro subscription.

EDIT: Apparently it's not available in Claude Code with the Pro subscription, but you can add funds to your Claude wallet and use Opus with pay-as-you-go. This is going to be really nice to use Opus for planning and Sonnet for implementation with the Pro subscription.

However, I noticed that the previously-there option of "use Opus for planning and Sonnet for implementation" isn't there in Claude Code with this setup any more. Hopefully they'll implement it soon, as that would be the best of both worlds.

EDIT 2: Apparently you can use `/model opusplan` to get Opus in planning mode. However, it says "Uses your extra balance", and it's not clear whether it means it uses the balance just in planning mode, or also in execution mode. I don't want it to use my balance when I've got a subscription, I'll have to try it and see.

EDIT 3: It looks like Sonnet also consumes credits in this mode. I had it make some simple CSS changes to a single HTML file with Opusplan, and it cost me $0.95 (way too much, in my opinion). I'll try manually switching between Opus for the plan and regular Sonnet for the next test.

vunderba4mo ago

Anecdotally, I kind of compare the quality of Sonnet 4.5 to that of a chess engine: it performs better when given more time to search deeper into the tree of possible moves (more plies). So when Anthropic is under peak load I think some degradation is to be expected. I just wish Claude Code had a "Signal Peak" so that I could schedule more challenging tasks for a time when its not under high demand.

mscbuck4mo ago

Yes, I've absolutely noticed this. I feel like I can always tell when something is up when it starts trying to do WAY more things than normal. Like I can give it a few functions and ask for some updates, and it just goes through like 6 rounds of thinking, creating 6 new files, assuming that I want to write changes to a database, etc.

bryanlarsen4mo ago

On Friday my Claude was particularly stupid. It's sometimes stupid, but I've never seen it been that consistently stupid. Just assumed it was a fluke, but maybe something was changing.

beydogan4mo ago

100% dumber, especially since last 3-4 days. I have two guesses:

- They make it dumber close to a new release to hype the new model

- They gave $1000 Claude Code Web credits to a lot of people, which increased the load a lot so they had to serve quantized version to handle the it.

I love Claude models but I hate this non transparency and instability.

kjgkjhfkjf4mo ago

My guess is that Claude's "bad days" are due to the service becoming overloaded and failing over to use cheaper models.

bamboozled4mo ago

Noticed it hard today, it's just "stupid" now.

simonw4mo ago

Notes and two pelicans: https://simonwillison.net/2025/Nov/24/claude-opus/

tkgally4mo ago

I added Opus 4.5 to my benchmark of 30 alternatives to your now-classic pelican-bicycle prompt (e.g., “Generate an SVG of a dragonfly balancing a chandelier”). Nine models are now represented:

https://gally.net/temp/20251107pelican-alternatives/index.ht...

2 more replies

dreis_sw4mo ago

I agree with your sentiment, this incremental evolution is getting difficult to feel when working with code, especially with large enterprise codebases. I would say that for the vast majority of tasks there is a much bigger gap on tooling than on foundational model capability.

1 more reply

jasonjmcghee4mo ago

Did you write the terminal -> html converter (how you display the claude code transcripts), or is that a library?

1 more reply

nojs4mo ago

> Thinking blocks from previous assistant turns are preserved in model context by default

This seems like a huge change no? I often use max thinking on the assumption that the only downside is time, but now there’s also a downside of context pollution

1 more reply

pjm3314mo ago

i think you have an error there about haiku pricing

> For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $4/$20.

i think haiku should be $1/$5

1 more reply

throwaway20274mo ago

I wonder if at this point they read what people use to benchmark with and specifically train it to do well at this task.

diego_sandoval4mo ago

:%s/There model/Their model/g

jasonthorsness4mo ago

I used Gemini instead of my usual Claude for a non-trivial front-end project [1] and it really just hit it out of the park especially after the update last week, no trouble just directly emitting around 95% of the application. Now Claude is back! The pace of releases and competition seems to be heating up more lately, and there is absolutely no switching cost. It's going to be interesting to see if and how the frontier model vendors create a moat or if the coding CLIs/models will forever remain a commodity.

[1] https://github.com/jasonthorsness/tree-dangler

diego_sandoval4mo ago

What IDE/CLI tool do you use?

1 more reply

hu34mo ago

Gemini is indeed great for frontend HTML + CSS and even some light DOM manipulation in JS.

I have been using Gemini 2.5 and now 3 for frontend mockups.

When I'm happy with the result, after some prompt massage, I feed it to Sonnet 4.5 to build full stack code using the framework of the application.

redfloatplane4mo ago

On my Max plan, Opus 4.5 is now the default model! Until now I used Sonnet 4.5 exclusively and never used Opus, even for planning - I'm shocked that this is so cheap (for them) that it can be the default now. I'm curious what this will mean for the daily/weekly limits.

A short run at a small toy app makes me feel like Opus 4.5 is a bit slower than Sonnet 4.5 was, but that could also just be the day-one load it's presumably under. I don't think Sonnet was holding me back much, but it's far too early to tell.

Robdel124mo ago

Right! I thought this at the very bottom was super interesting

1 more reply

agentifysh4mo ago

wish they really bolded that part because i almost passed off on it until i read the blog carefully

instant upgrade to claude max 20x if they give opus 4.5 out like this

i still like codex-5.1 and will keep it.

gemini cli missed its opportunity again now money is hedged between codex and claude.

MaxLeiter4mo ago

We've added support for opus 4.5 to v0 and users are making some pretty impressive 1-shots:

https://x.com/mikegonz/status/1993045002306699704

https://x.com/MirAI_Newz/status/1993047036766396852

https://x.com/rauchg/status/1993054732781490412

It seems especially good at threejs / 3D websites. Gemini was similarly good at them (https://x.com/aymericrabot/status/1991613284106269192); maybe the model labs are focusing on this style of generation more now.

pwython4mo ago

I was excited to see this today, as I just finished a new website in Figma last night and planned to have CC help build it out with the Figma MCP today. CC started up with Opus 4.5 and... Not off to a good start. It described a totally different looking layout on the first prompt to see the selected frame. I asked why:

You're right to call that out. Looking back at what happened:

  1. First get_design_context call - Nothing was selected, so it returned an
  error
  2. Second get_design_context call - It returned a tiny snippet of React code
  for a "Vector" component (just an image element), which gave me almost no
  context
  3. First get_screenshot call - The image returned was essentially blank/empty
  (the output showed <output_image> with what appears to be no actual visual
  content)

  I then hallucinated an entire design description based on:
  - The project context from CLAUDE.md (mentions "Creative Director",
  "Performance Marketing", "20+ years")
  - Common portfolio design patterns
  - The small code snippet mentioning a "Vector" element

  That was a significant error on my part. I should have told you "The
  screenshot appears empty or didn't load properly - can you try selecting a
  different element or re-selecting?" instead of fabricating a description.

jumploops4mo ago

> Pricing is now $5/$25 per million [input/output] tokens

So it’s 1/3 the price of Opus 4.1…

> [..] matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens

…and potentially uses a lot less tokens?

Excited to stress test this in Claude Code, looks like a great model on paper!

alach114mo ago

This is the biggest news of the announcement. Prior Opus models were strong, but the cost was a big limiter of usage. This price point still makes it a "premium" option, but isn't prohibitive.

Also increasingly it's becoming important to look at token usage rather than just token cost. They say Opus 4.5 (with high reasoning) used 50% fewer tokens than Sonnet 4.5. So you get a higher score on SWE-bench verified, you pay more per token, but you use fewer tokens and overall pay less!

jmkni4mo ago

> Pricing is now $5/$25 per million tokens

For anyone else confused, it's input/output tokens

$5 for 1million tokens in $25 for 1million tokens out

2 more replies

dent94mo ago

All the users in the comments here complaining about API limits and usage limits have missed the boat. You're not the target audience. This AI is not for you. It's not for consumers and end users. This AI is for the multi-billion and trillion-dollar businesses who are signing massive contracts to get these models enabled for their entire company. I've been using Sonnet 4.5 for months and never had a usage limit ever. And I used every model before that, all day and all night, and never once saw any mention of usage limits. Never saw a bill either. If "price per token" is a concern to you then you already lost.

bigmadshoe4mo ago

How could price per token not be a concern for any “multi-billion” or “multi-trillion dollar” business? Do they just burn money to remain profitable?

1 more reply

andai4mo ago

Why do they always cut off 70% of the y-axis? Sure it exaggerates the differences, but... it exaggerates the differences.

And they left Haiku out of most of the comparisons! That's the most interesting model for me. Because for some tasks it's fine. And it's still not clear to me which ones those are.

Because in my experience, Haiku sits at this weird middle point where, if you have a well defined task, you can use a smaller/faster/cheaper model than Haiku, and if you don't, then you need to reach for a bigger/slower/costlier model than Haiku.

ximeng4mo ago

It’s a pretty arbitrary y axis - arguably the only thing that matters is the differences.

waynenilsen4mo ago

marketing.

obblekk4mo ago

80% on swebench verified is incredible. a year ago the best model was at ~30%. i wonder if we'll soon have a convincingly superhuman coding capability (even in a narrow field like kernel optimization).

this is the most interesting time for software tools since compilers and static typechecking was invented.

quantumHazer4mo ago

Last year’s model were at 50-60% on SWE bench-verified actually

1 more reply

keeeba4mo ago

Oh boy, if the benchmarks are this good and Opus feels like it usually does then this is insane.

I’ve always found Opus significantly better than the benchmarks suggested.

LFG

ddxv4mo ago

The LLMs rate of improvement has really slowed down. This looks like a minor improvement in terms of accuracy and big gains from efficiency.

energy1234mo ago

14 months ago we had GPT-4 and now we have models that can get a gold medal at the IMO.

But sure, if you curve fit to the last 3 months you could say things are slowing down, but that's hyper fixating on a very small amount of information.

2 more replies

irthomasthomas4mo ago

I wish it was open-weights so we could discuss the architectural changes. This model is about twice as fast as 4.1, ~60t/s Vs ~30t/s. Is it half the parameters, or a new INT4 linear sparse-moe architecture?

elvin_d4mo ago

Great seeing the price reduction. Opus historically was prices at 15/75, this one delivers at 5/25 which is close to Gemini 3 Pro. I hope Anthropic can afford increasing limits for the new Opus.

gigatexal4mo ago

Love the competition. Gemini 3 pro blew me away after being spoiled by Claude for coding things. Considered canceling my Anthropic sub but now I’m gonna hold on to it.

The bigger thing is Google has been investing in TPUs even before the craze. They’re on what gen 5 now ? Gen 7? Anyway I hope they keep investing tens of billions into it because Nvidia needs to have some competition and maybe if they do they’ll stop this AI silliness and go back to making GPUs for gamers. (Hahaha of course they won’t. No gamer is paying 40k for a GPU.)

sbinnee4mo ago

As much as I am excited by the price, the tools they called "the advanced tool"[1] look so useful to me; Tool search, programmatic tool calling (smolagents.CodeAgent by HF), and tool use examples (in-context learning).

They said that they have seen 134K tokens for tool definition alone. That is insane. I also really liked the puzzle game video.

[1] https://www.anthropic.com/engineering/advanced-tool-use

nickandbro4mo ago

"Create me a SVG of a PS4 controller"

Gemini 3.0 Pro: https://www.svgviewer.dev/s/CxLSTx2X

Opus 4.5: https://www.svgviewer.dev/s/dOSPSHC5

I think Opus 4.5 did a bit better overall, but I do think eventually frontier models will eventually converge to a point where the quality will be so good it will be hard to tell the winner.

esperent4mo ago

I can only see the svg code there on mobile. I don't see any way to view the output.

1 more reply

chaosprint4mo ago

SWE's results were actually very close, but they used a poor marketing visualization. I know this isn't a research paper, but for Anthropic, I expect more.

flakiness4mo ago

They should've used an error rate instead of the pass rate. Then it'll get the same visual appeal without cheating.

jaakkonen4mo ago

Tested this today for implementing a new low-frequency RFID protocol to Flipper Zero codebase based on a Proxmark3 implementation. Was able to do it in 2 hours with giving a raw psk recording alongside of it and some troubleshooting. This is the kind of task the last generation of frontier models was incapable of doing. Super stoked to use this :)

achierius4mo ago

Was this just 2 hours of the agent running on its own, or was there back-and-forth/any sort of interaction? How much did you have to set up scaffolding, e.g. tests?

ofermend4mo ago

Can't wait to try Opus 4.5

We just evaluated it for Vectara's grounded hallucination leaderboard: it scores at 10.9% hallucination rate, better than Gemini-3, GPT-5.1-high or Grok-4.

https://github.com/vectara/hallucination-leaderboard

andreybaskov4mo ago

Does anyone know or have a guess on the size of this latest thinking models and what hardware they use to run inference? As in how much memory and what quantization it uses and if it's "theoretically" possible to run it on something like Mac Studio M3 Ultra with 512GB RAM. Just curious from theoretical perspective.

threeducks4mo ago

Rough ballpark estimate:

- Amazon Bedrock serves Claude Opus 4.5 at 57.37 tokens per second: https://openrouter.ai/anthropic/claude-opus-4.5

- Amazon Bedrock serves gpt-oss-120b at 1748 tokens per second: https://openrouter.ai/openai/gpt-oss-120b

- gpt-oss-120b has 5.1B active parameters at approximately 4 bits per parameter: https://huggingface.co/openai/gpt-oss-120b

To generate one token, all active parameters must pass from memory to the processor (disregarding tricks like speculative decoding)

Multiplying 1748 tokens per second with the 5.1B parameters and 4 bits per parameter gives us a memory bandwidth of 4457 GB/sec (probably more, since small models are more difficult to optimize).

If we divide the memory bandwidth by the 57.37 tokens per second for Claude Opus 4.5, we get about 80 GB of active parameters.

With speculative decoding, the numbers might change by maybe a factor of two or so. One could test this by measuring whether it is faster to generate predictable text.

Of course, this does not tell us anything about the number of total parameters. The ratio of total parameters to active parameters can vary wildly from around 10 to over 30:

    120 : 5.1 for gpt-oss-120b
    30 : 3 for Qwen3-30B-A3B
    1000 : 32 for Kimi K2
    671 : 37 for DeepSeek V3

Even with the lower bound of 10, you'd have about 800 GB of total parameters, which does not fit into the 512 GB RAM of the M3 Ultra (you could chain multiple, at the cost of buying multiple).

But you can fit a 3 bit quantization of Kimi K2 Thinking, which is also a great model. HuggingFace has a nice table of quantization vs required memory https://huggingface.co/unsloth/Kimi-K2-Thinking-GGUF

2 more replies

docjay4mo ago

That all depends on what you consider to be reasonably running it. Huge RAM isn’t required to run them, that just makes them faster. I imagine technically all you'd need is a few hundred megabytes for the framework and housekeeping, but you’d have to wait for the some/most/all of the model to be read off the disk for each token it processes.

None of the closed providers talk about size, but for a reference point of the scale: Kimi K2 Thinking can spar in the big leagues with GPT-5 and such…if you compare benchmarks that use words and phrasing with very little in common with how people actually interact with them…and at FP16 you’ll need 2.9TB of memory @ 256,000 context. It seems it was recently retrained it at INT4 (not just quantized apparently) and now:

“ The smallest deployment unit for Kimi-K2-Thinking INT4 weights with 256k seqlen on mainstream H200 platform is a cluster with 8 GPUs with Tensor Parallel (TP). (https://huggingface.co/moonshotai/Kimi-K2-Thinking) “

-or-

“ 62× RTX 4090 (24GB) or 16× H100 (80GB) or 13× M3 Max (128GB) “

So ~1.1TB. Of course it can be quantized down to as dumb as you can stand, even within ~250GB (https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-l...).

But again, that’s for speed. You can run them more-or-less straight off the disk, but (~1TB / SSD_read_speed + computation_time_per_chunk_in_RAM) = a few minutes per ~word or punctuation.

1 more reply

nickandbro4mo ago

I use the following models like so nowadays:

Gemini is great, when you have gitingested the code of pypi package and want to use it as context. This comes in handy for tasks and repos outside the model's training data.

5.1 Codex I use for a narrowly defined task where I can just fire and forget it. For example, codex will troubleshoot why a websocket is not working, by running its own curl requests within cursor or exec'ing into the docker container to debug at a level that would take me much longer.

Claude 4.5 Opus is a model that I feels trustworthy for heavy refactors of code bases or modularizing sections of code to become more manageable. Often it seems like the model doesn't leave any details out and the functionality is not lost or degraded.

swapnilt4mo ago

Opus 4.5's scaling is impressive on benchmarks, but the usual caveats apply: benchmark saturation is real, and we're seeing diminishing returns on evals that test pattern-matching vs. genuine reasoning. The more relevant question: has anyone stress-tested this on novel problems or complex multi-step reasoning outside training data distributions? Marketing often showcases 'advanced math' and 'code generation' where the solutions exist in training data. The claim of 'reasoning improvement' needs validation on genuinely unfamiliar problem classes.

sync4mo ago

Does anyone here understand "interleaved scratchpads" mentioned at the very bottom of the footnotes:

> All evals were run with a 64K thinking budget, interleaved scratchpads, 200K context window, default effort (high), and default sampling settings (temperature, top_p).

I understand scratchpads (e.g. [0] Show Your Work: Scratchpads for Intermediate Computation with Language Models) but not sure about the "interleaved" part, a quick Kagi search did not lead to anything relevant other than Claude itself :)

[0] https://arxiv.org/abs/2112.00114

dheerkt4mo ago

based on their past usage of "interleaved tool calling" it means that the tool can be used while the model is thinking.

https://aws.amazon.com/blogs/opensource/using-strands-agents...

1 more reply

alvis4mo ago

“For Max and Team Premium users, we’ve increased overall usage limits, meaning you’ll have roughly the same number of Opus tokens as you previously had with Sonnet.” — seems like anthropic has finally listened!

Havoc4mo ago

Interesting that the number of hn comments on big model announcements seems to be dropping. I recall previous ones easily surpassing 1k

Maybe models are starting to get good enough/ levelling off?

sd94mo ago

It's fatigue. This is the third major model announcement in the last week.

On the other hand, this is the one I'm most excited by. I wouldn't have commented at all if it wasn't for your comment. But I'm excited to start using this.

1 more reply

lm284694mo ago

"we gained 2.7% in these artificial benchmarks and here is a picture of a pelican on a bicycle, get excited and give us $7 trillion please"

morgengold4mo ago

I'm on a Claude Code Max subscription. Last days have been a struggle with Sonnet 4.5 - Now it switched to Claude Opus 4.5 as default model. Ridiculous good and fast.

starkparker4mo ago

Would love to know what's going on with C++ and PHP benchmarks. No meaningful gain over Opus 4.1 for either, and Sonnet still seems to outperform Opus on PHP.

aliljet4mo ago

The real question I have after seeing the usage rug being pulled is what this costs and how usable this ACTUALLY is with a Claude Max 20x subscription. In practice, Opus is basically unusable by anyone paying enterprise-prices. And the modification of "usage" quotas has made the platform fundamentally unstable, and honestly, it left me personally feeling like I was cheated by Anthropic...

ximeng4mo ago

With less token usage, cheaper pricing, and enhanced usage limits for Opus, Anthropic are taking the fight to Gemini and OpenAI Codex. Coding agent performance leads to better general work and personal task performance, so if Anthropic continue to execute well on ergonomics they have a chance to overcome their distribution disadvantages versus the other top players.

syspec4mo ago

Thank you Claude.

GenerWork4mo ago

I wonder what this means for UX designers like myself who would love to take a screen from Figma and turn it into code with just a single call to the MCP. I've found that Gemini 3 in Figma Make works very well at one-shotting a page when it actually works (there's a lot of issues with it actually working, sadly), so hopefully Opus 4.5 is even better.

gsibble4mo ago

They lowered the price because this is a massive land grab and is basically winner take all.

I love that Antrhopic is focused on coding. I've found their models to be significantly better at producing code similar to what I would write, meaning it's easy to debug and grok.

Gemini does weird stuff and while Codex is good, I prefer Sonnet 4.5 and Claude code.

rw24mo ago

Gemini 3 in antigravity is significantly better than Claude code with either Opus or Sonnet that I struggle to see how they can compete. And I'm someone with the 100 dollar/month plan.

I can't even use Opus for a day before it runs out before. This will make it better but Antigravity has way better UI and also bug solving.

pingou4mo ago

What causes the improvements in new AI models recently? Is it just more training, or is it new, innovative techniques?

I_am_tiberius4mo ago

Some months back they changed their terms of service and by default users now allow Anthropic to use prompts for learning. As it's difficult to know if your prompts, or derivations of it, are part of a model, I would consider the possibility that they use everyone's prompt.

saaaaaam4mo ago

Anecdotally, I’ve been using opus 4.5 today via the chat interface to review several large and complex interdependent documents, fillet bits out of them and build a report. It’s very very good at this, and much better than opus 4.1. I actually didn’t realise that I was using opus 4.5 until I saw this thread.

jmward014mo ago

One thing I didn't see mentioned is raw token gen speed compared to the alternatives. I am using Haiku 4.5 because it is cheap (and so am I) but also because it is fast. Speed is pretty high up in my list of coding assistant features and I wish it was more prominent in release info.

adidoit4mo ago

Tested this building some PRs and issues that codex-5.1-max and gemini-3-pro were strugglig with

It planned way better in a much more granular way and then execute it better. I can't tell if the model is actually better or if it's just planning with more discipline

viraptor4mo ago

Has there been any announcement of a new programming benchmark? SWE looks like it's close to saturation already. At this point for SWE it may be more interesting to start looking at which types of issues consistently fail/work between model families.

Mkengin4mo ago

I like this one: https://swe-rebench.com/

agentifysh4mo ago

again the question of concern as codex user is usage

its hard to get any meaningful use out of claude pro

after you ship a few features you are pretty much out of weekly usage

compared to what codex-5.1-max offers on a plan that is 5x cheaper

the 4~5% improvement is welcome but honestly i question whether its possible to get meaningful usage out of it the way codex allows it

for most use cases medium or 4.5 handles things well but anthropic seems to have way less usage limits than what openai is subsidizing

until they can match what i can get out of codex it won't be enough to win me back

edit: I upgraded to claude max! read the blog carefully and seems like opus 4.5 is lifted in usage as well as sonnet 4.5!

jstummbillig4mo ago

Well, that's where the price reduction comes in handy, no?

2 more replies

mutewinter4mo ago

Some early visual evaluations: https://x.com/mutewinter/status/1993037630209192276

adastra224mo ago

Does it follow directions? I’ve found Sonnet 4.5 to be useless for automated workflows because it refuses to follow directions. I hope they didn’t take the same RLHF approach they did with that model.

carcabob4mo ago

I wish the article's graphs weren't distorted by skipping so much of the scale to make it look like a more significant difference than it is. But it does looks impressive.

maherbeg4mo ago

Ok, the victorian lock puzzle game is pretty damn cool way to showcase the capabilities of these models. I kinda want to start building similar puzzle games for models to solve.

PilotJeff4mo ago

More blowing up of the bubble with anthropic essentially offering compute/LLM for below cost. Eventually the laws of physics/market will take over and look out below.

jstummbillig4mo ago

How would you know what the cost is?

jedberg4mo ago

Up until today, the general advice was use Opus for deep research, use Haiku for everything else. Given the reduction in cost here, does that rule of thumb no longer apply?

mudkipdev4mo ago

In my opinion Haiku is capable but there is no reason to use anything lower than Sonnet unless you are hitting usage limits

sd94mo ago

After experimenting with Gemini 3, I still felt like Sonnet 4.5 had the edge. So I'm very excited to start playing with this in the wild.

ramon1564mo ago

I've almost ran out of Claude on the Web credits. If they announce that they're going to support Opus then I'm going to be sad :'(

undeveloper4mo ago

haven't they all expired by now?

adt4mo ago

https://lifearchitect.ai/models-table/

alvis4mo ago

What surprise me is that Opus 4.5 lost all reasoning scores to Gemini and GPT. I thought it’s the area the model will shine the most

thot_experiment4mo ago

It's really hard for me to take these benchmarks seriously at all, especially that first one where Sonnet 4.5 is better at software engineering than Opus 4.1.

It is emphatically not, it has never been, I have used both models extensively and I have never encountered a single situation where Sonnet did a better job than Opus. Any coding benchmark that has Sonnet above Opus is broken, or at the very least measuring things that are totally irrelevant to my usecases.

This in particular isn't my "oh the teachers lie to you moment" that makes you distrust everything they say, but it really hammers the point home. I'm glad there's a cost drop, but at this point my assumption is that there's also going to be a quality drop until I can prove otherwise in real world testing.

mirsadm4mo ago

These announcements and "upgrades" are becoming increasingly pointless. No one is going to notice this. The improvements are questionable and inconsistent. They could swap it out for an older model and no one would notice.

1 more reply

AbstractH244mo ago

Amazing how every company's newest model performs best in the benchmarks they share in the announcment....

johnnycombin4mo ago

the most overhyped model ever, not even close to Gemini3 or GPT5.1 after 8h of complex tasks.

winrid4mo ago

So far this seems like a huge downgrade from Opus 4.1. Please add back 4.1 as an option...

t0lo4mo ago

So are we in agreement that claude is the thinking persons model and openai is for the masses

whitepoplar4mo ago

Does the reduced price mean increased usage limits on Claude Code (with a Max subscription)?

skerit4mo ago

Yes. Opus is now the default model in Claude Code. And Opus 4.5 counts the same toward your usage limit as Sonnet 4.5 did.

Even better: Sonnet 4.5 now has its own separate limit.

rishabhaiover4mo ago

Is this available on claude-code?

elvin_d4mo ago

Yes, the first run was nice - feels faster than 4.1 and did what Sonnet 4.5 struggled to execute properly.

greenavocado4mo ago

What are you thinking of trying to use it for? It is generally a huge waste of money to unleash Opus on high content tasks ime

2 more replies

rishabhaiover4mo ago

damn, I need a MAX sub for this.

1 more reply

andai4mo ago

This one is different. IYKYK...

rutagandasalim4mo ago

claude opus 4.5 is an incredible model i just one-shoted https://aithings.dev with it

jstummbillig4mo ago

What was the prompt?

CuriouslyC4mo ago

I hate on Anthropic a fair bit, but the cost reduction, quota increases and solid "focused" model approach are real wins. If they can get their infrastructure game solid, improve claude code performance consistency and maintain high levels of transparency I will officially have to start saying nice things about them.

I_am_tiberius4mo ago

Still mad at them because they decided not to take their users' privacy serious. Would be interested how the new model behaves, but just have a mental lock and can't sign up again.

quantummagic4mo ago

I would look past their privacy issues and have wanted to sign up for over a year, but don't have a cellphone, which is required to register.

throwaway20274mo ago

Oh that's why there were only 2 usage bars.

xkbarkar4mo ago

This is great. Sonnet 4.5 has degraded terribly.

I can get some useful stuff from a clean context in the web ui but the cli is just useless.

Opus is far superiour.

Today sonnet 4.5 suggested to verify remote state file presence by creating an empty one locally and copy it to the remote backend. Da fuq? University level programmer my a$$.

And it seems like it has degraded this last month.

I keep getting braindead suggestions and code that looks like it came from a random word generator.

I swear it was not that awful a couple of months ago.

Opus cap has been an issue, happy to change and I really hope the nerf rumours are just that. Undounded rumours and the defradation has a valid root cause

But honestly sonnet 4.5 has started to act like a smoking pile of sh**t

idonotknowwhy4mo ago

>This is great. Sonnet 4.5 has degraded terribly. >I can get some useful stuff from a clean context in the web ui but the cli is just useless. >I swear it was not that awful a couple of months ago.

I agree on all 3 counts. And it still degrades after a few long turns in openwebui. You can test this by regenerating the last reply in chats from shortly after the model was released.

kachapopopow4mo ago

slightly better at react and spacial logic than gemini 3 pro, but slower and way more expensive.

synergy204mo ago

great, paying $100/m for claude code, this stops me from switching to gemini 3.0 for now.

clbrmbr4mo ago

Does anyone have a benchmark that clearly distinguishes the larger models? I would think that the high parameter count models would have capabilities distinct from the smaller ones, that would easily be read out. For example, Opus 4 has apparently memorized many books. If you ask it just right (to get around the infuriating copyright controls), it will complete a paragraph from The Wealth of Nations or Aristotle’s Nicomachean Ethics in Ancient Greek. That cannot be possible on a smaller model that needs to compress more.

GodelNumbering4mo ago

The fact that the post singled out SWE-bench at the top makes the opposite impression that they probably intended.

grantpitt4mo ago

do say more

1 more reply

cyrusradfar4mo ago

I'm curious if others are finding that there's a comfort in staying within the Claude ecosystem because when it makes a mistake, we get used to spotting the pattern. I'm finding that when I try new models, their "stupid" moments are more surprising and infuriating.

Given this tech is new, the experience of how we relate to their mistakes is something I think a bit about.

Am I alone here, are others finding themselves more forgiving of "their preferred" model provider?

irthomasthomas4mo ago

I guess you where not around a few months back when they over-optimized and served a degraded model for weeks.

1 more reply

tschellenbach4mo ago

Ok, but can it play Factorio?

fragmede4mo ago

Got the river crossing one:

https://claude.ai/chat/0c583303-6d3e-47ae-97c9-085cefe14c21

Still fucked up one about the boy and the surgeon though:

https://claude.ai/chat/d2c63190-059f-43ef-af3d-67e7ca1707a4

AJRF4mo ago

that chart at the start is egregious

tildef4mo ago

Feels like a tongue-in-cheek jab at the GPT-5 announcement chart.

robertwt74mo ago

this is very impressive! as much as I love Claude though, is it just me or their limit is much lower compared to others (Gemini and GPT)? At the moment I'm subscribed to Google One AI ($20) which gives me the most value with the 2tb google drive and Cursor ($20). I've subscribed to GPT and Claude as well in the past, I find that I was hitting the limit much faster in Claude compared to all the others, it made me reluctant to subscribe again. from the blog post it seems like they've been prioritising the Max users most of the time?

0x79de4mo ago

this is quite a good

lerp-io4mo ago

80% and 77% is not that much lol

zb34mo ago

The first chart is straight from "how to lie in charts"..

1 more reply

j / k navigate · click thread line to collapse

506 comments

llamasushi4mo ago

The "most aligned model" framing is doing a lot of heavy lifting though. Would love to see third-party red team results.

tekacs4mo ago

This is also super relevant for everyone who had ditched Claude Code due to limits:

tifik4mo ago

1 more reply

Aeolun4mo ago

1 more reply

js4ever4mo ago

Interesting. I totally stopped using opus on my max subscription because it was eating 40% of my week quota in less than 2h

TrueDuality4mo ago

Now THAT is great news

1 more reply

throwaway-aws93mo ago

brianjking4mo ago

They also reset limits today, which was also quite kind as I was already 11% into my weekly allocation.

astrange4mo ago

Just avoid using Claude Research, which I assume still instantly eats most of your token limits.

sqs4mo ago

What's super interesting is that Opus is cheaper all-in than Sonnet for many usage patterns.

Here are some early rough numbers from our own internal usage on the Amp team (avg cost $ per thread):

- Sonnet 4.5: $1.83

- Opus 4.5: $1.30 (earlier checkpoint last week was $1.55)

- Gemini 3 Pro: $1.21

Cost per token is not the right way to look at this. A bit more intelligence means mistakes (and wasted tokens) avoided.

localhost4mo ago

Much better to look at cost per task - and good to see some benchmarks reporting this now.

1 more reply

leo_e4mo ago

Hard agree. The hidden cost of 'cheap' models is the complexity of the retry logic you have to write around them.

andai4mo ago

Yeah, that's a great point.

ArtificialAnalysis has a "intelligence per token" metric on which all of Anthropic's models are outliers.

For some reason, they need way less output tokens than everyone else's models to pass the benchmarks.

(There are of course many issues with benchmarks, but I thought that was really interesting.)

tmaly4mo ago

what is the typical usage pattern that would result in these cost figures?

1 more reply

sharkjacobs4mo ago

3x price drop almost certainly means Opus 4.5 is a different and smaller base model than Opus 4.1, with more fine tuning to target the benchmarks.

I'll be curious to see how performance compares to Opus 4.1 on the kind of tasks and metrics they're not explicitly targeting, e.g. eqbench.com

nostrademons4mo ago

4 more replies

ACCount374mo ago

Probably more sparse (MoE) than Opus 4.1. Which isn't a performance killer by itself, but is a major concern. Easy to get it wrong.

cootsnuck4mo ago

We already know distillation works pretty well. So definitely would make sense Opus 4.5 is effectively smaller (like someone else said, could be via MoE or some other technique too).

We know the big labs are chasing efficiency cans where they can.

adgjlsfhk14mo ago

It seems plausible that it's a similar size model and that the 3x drop is just additional hardware efficiency/lowered margin.

2 more replies

losvedir4mo ago

shepherdjerred4mo ago

I thought AI safety was dumb/unimportant until I saw this dataset of dangerous prompts: https://github.com/mlcommons/ailuminate/blob/main/airr_offic...

I don't love the idea of knowledge being restricted... but I also think these tools could result in harm to others in the wrong hands

7 more replies

wkat42424mo ago

Jailbreaking is trivial though. If anything really bad could happen it would have happened already.

And the prudeness of American models in particular is awful. They're really hard to use in Europe because they keep closing up on what we consider normal.

NiloCK4mo ago

Waymos, LLMs, brain computer interfaces, dictation and tts, humanoid robots that are worth a damn.

Ye best start believing in silly sci-fi stories. Yer in one.

narrator4mo ago

Pliney the Liberator jailbroke it in no time. Not sure if this applies to prompt injection:

https://x.com/elder_plinius/status/1993089311995314564

cmrdporcupine4mo ago

Note the comment when you start claude code:

"To give you room to try out our new model, we've updated usage limits for Claude Code users."

That really implies non-permanence.

Xlr8head4mo ago

Still better than perma-nonce.

AtNightWeCode4mo ago

windexh8er4mo ago

This is spot on. The amount of wasteful output tokens from Claude is crazy. The actual output you're looking for might be better, but you're definitely going to pay for it in the long run.

1 more reply

Scene_Cast24mo ago

Still way pricier (>2x) than Gemini 3 and Grok 4. I've noticed that the latter two also perform better than Opus 4, so I've stopped using Opus.

pants24mo ago

Don't be so sure - while I haven't tested Opus 4.5 yet, Gemini 3 tends to use way more tokens than Sonnet 4.5. Like 5-10X more. So Gemini might end up being more expensive in practice.

2 more replies

wolttam4mo ago

It's 1/3 the old price ($15/$75)

brookst4mo ago

Not sure if that’s a joke about LLM math performance, but pedantry requires me to point out 15 / 75 = 1/5

4 more replies

llamasushi4mo ago

Just updated, thanks

tom_m4mo ago

It was already viable pricing before. You have to remember this is for business use. Many companies will pay 20% on top of an engineer's salary to have them be 200% as effective. Right?

resonious4mo ago

jstummbillig4mo ago

It's both kinda neat and irritating, how many parallels there are between this AI paradigm and what we do.

burgerone4mo ago

Using AI in production is no doubt an enormous security risk...

laterium4mo ago

Where's the argument? Or we're just asserting things?

delaminator4mo ago

Not all production processes untrusted input.

irthomasthomas4mo ago

It's about double the speed of 4.1, too. ~60t/s vs ~30t/s. I wish it where openweights so we could discuss the architectural changes.

RestartKernel4mo ago

> [...] that's legitimately significant for anyone deploying agents with tool access.

I disagree, even if only because your model shouldn't have more access than any other front-end.

antihero4mo ago

consumer4514mo ago

> Claude Opus 4.5 in Windsurf for 2x credits (instead of 20x for Opus 4.1)

https://old.reddit.com/r/windsurf/comments/1p5qcus/claude_op...

At the risk of sounding like a shill, in my personal experience, Windsurf is somehow still the best deal for an agentic VSCode fork.

zwnow4mo ago

unsupp0rted4mo ago

This is gonna be game-changing for the next 2-4 weeks before they nerf the model.

Then for the next 2-3 months people complaining about the degradation will be labeled “skill issue”.

Then a sacrificial Anthropic engineer will “discover” a couple obscure bugs that “in some cases” might have lead to less than optimal performance. Still largely a user skill issue though.

Then a couple months later they’ll release Opus 4.7 and go through the cycle again.

My allegiance to these companies is now measured in nerf cycles.

I’m a nerf cycle customer.

lukev4mo ago

There are two possible explanations for this behavior: the model nerf is real, or there's a perceptual/psychological shift.

However, benchmarks exist. And I haven't seen any empirical evidence that the performance of a given model version grows worse over time on benchmarks (in general.)

Therefore, some combination of two things are true:

1. The nerf is psychologial, not actual. 2. The nerf is real but in a way that is perceptual to humans, but not benchmarks.

davidsainez4mo ago

There are well documented cases of performance degradation: https://www.anthropic.com/engineering/a-postmortem-of-three-....

1 more reply

jaggs4mo ago

https://www.youtube.com/watch?v=DtePicx_kFY

Llion Jones, co-inventor of transformers architecture

1 more reply

data-ottawa4mo ago

The previous “nerf” was actually several bugs that dramatically decreased performance for weeks.

I do also suspect there’s a bit of mental adjustment that goes in too.

blurbleblurble4mo ago

1 more reply

imiric4mo ago

Or, 2b: the nerf is real, but benchmarks are gamed and models are trained to excel at them, yet fall flat in real world situations.

1 more reply

camdenreslink4mo ago

parineum4mo ago

That's case #2 for you but I think the explanation I've proposed is pretty likely.

conception4mo ago

The only time Ive seen benchmark nerfing is I saw one see a drop in performance between 2.5 march preview and release.

csomar4mo ago

They are nerfed and there is actually a very simple test to prove otherwise: 0 temperature. This is only allowed with the API where you are billed full token prices.

Conclusion: It is nerfed unless Claude can prove otherwise.

1 more reply

teruakohatu4mo ago

> 1. The nerf is psychologial, not actual. 2. The nerf is real but in a way that is perceptual to humans, but not benchmarks.

They could publish weekly benchmarks. To disprove. They almost certainly have internal benchmarking.

The shift is certainly real. It might not be model performance but contextual changes or token performance (tasks take longer even if the model stays the same).

1 more reply

yawnxyz4mo ago

zsoltkacsandi4mo ago

> The nerf is psychologial, not actual

Is this empirical evidence?

And this is not only my experience.

Calling this phychological is gaslighting.

5 more replies

film424mo ago

This is why I migrated my apps that need an LLM to Gemini. No model degradation so far all through the v2.5 model generation. What is Anthropic doing? Swapping for a quantized version of the model?

TIPSIO4mo ago

Hilarious sarcastic comment but actually true sentiment.

For all we know this is just the Opus 4.0 re-released

nullbio4mo ago

arresin4mo ago

And then Dario gives an interview on why open source models should be banned due to _____.

1 more reply

yesco4mo ago

unshavedyak4mo ago

They added a "How is claude doing?" rating a while back which backs this statement up imo. Tons of A/B tests going on i bet.

1 more reply

all24mo ago

blinding-streak4mo ago

Very intriguing, curious if others have seen this.

1 more reply

F7F7F74mo ago

Couldn’t have said it better myself. I’ve cancelled my x20 two times now and they keep pulling me back.

hacb4mo ago

it will be just enough time to finish my quarter roadmap and chill until january

vagab0nd4mo ago

I did not know this but it's consistent with the behaviors of the CEO.

idonotknowwhy4mo ago

100%. They've been nerfing the model periodically since at least Sonnet 3.5, but this time it's so bad I ended up swapping out to GLM4.6 just to finish off a simple feature.

Capricorn24814mo ago

Thank god people are noticing this. I'm pretty sick of companies putting a higher number next to models and programmers taking that at face value.

fHr4mo ago

haha couldn't have put this better, exactly this

WNWceAJ9R9Ezc44mo ago

Accurate.

stingraycharles4mo ago

I’m disappointed that this type of discourse has now entered HN. I expected a more evidence-based less “nerf cycle” discussion over here.

Etheryte4mo ago

This is nothing new and it's been discussed numerous times. Would you also say we need more evidence that Meta is tracking people?

1 more reply

blurbleblurble4mo ago

Maybe I should just start paying via tokens for a hopefully more consistent experience.

throwuxiytayq4mo ago

y’all hallucinating harder than GPT2 on DMT

2 more replies

827a4mo ago

mritchie7124mo ago

> only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think part of it is this[0] and I expect it will become more of a problem.

Claude models have built-in tools (e.g. `str_replace_editor`) which they've been trained to use. These tools don't exist in Cursor, but claude really wants to use them.

0 - https://x.com/thisritchie/status/1944038132665454841?s=20

bgrainger4mo ago

This feels like a dumb question, but why doesn't Cursor implement that tool?

3 more replies

HugoDias4mo ago

At least I’m coding more again, lol

4 more replies

vunderba4mo ago

My workflow was usually to use Gemini 2.5 Pro (now 3.0) for high-level architecture and design. Then I would take the finished "spec" and have Sonnet 4.5 perform the actual implementation.

nevir4mo ago

Same here. Gemini really excels at all the "softer" parts of the development process (which, TBH, feels like most of the work). And Claude kicks ass at the actual code authoring.

It's a really nice workflow.

config_yml4mo ago

I use plan mode in claude code, then use gpt-5 in codex to review the plan and identify gaps and feed it back to claude. Results are amazing.

3 more replies

SkyPuncher4mo ago

This is how I do it. Though, I've been using Composer as my main driver more an more.

vessenes4mo ago

jeswin4mo ago

Same here. But with GPT 5.1 instead of Gemini.

UltraSane4mo ago

I've done this and it seems to work well. I ask Gemini to generate a prompt for Claude Code to accomplish X

joewhale4mo ago

What specific output would you ask Gemini to create for Sonnet? Thanks in advance!

lvl1554mo ago

int_19h4mo ago

Also, Gemini has that huge context window, which depending on the task can be a big boon.

1 more reply

egeozcan4mo ago

2 more replies

jdgoesmarching4mo ago

Personally my hype is for the price, especially for Flash. Before Sonnet 4.5 was competitive with Gemini 2.5 Pro, the latter was a much better value than Opus 4.1.

thousand_nights4mo ago

with gemini you have to spend 30 minutes deleting hundreds of useless comments littered in the code that just describe what the code itself does

2 more replies

chinathrow4mo ago

I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

It gave me the Youtube-URL to Rick Astley.

arghwhat4mo ago

If you're asking an LLM to compute something "off the top of its head", you're using it wrong. Ask it to write the code to perform the computation and it'll do better.

Same with asking a person to solve something in their head vs. giving them an editor and a random python interpreter, or whatever it is normal people use to solve problems.

2 more replies

hu34mo ago

> I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

This is what I imagine the LLM usage of people who tell me AI isn't helpful.

It's like telling me airplanes aren't useful because you can't use them in McDonald's drive-through.

astrojams4mo ago

I find it hilarious that it rick rolled you. I wonder if that is an easter egg of some sort?

mikestorrent4mo ago

You should probably tell AI to write you programs to do tasks that programs are better at than minds.

stavros4mo ago

Don't use LLMs for a task a human can't do, they won't do it well.

1 more reply

idonotknowwhy4mo ago

Almost any modern LLM can do this, even GPT-OSS

gregable4mo ago

it. Not him.

2 more replies

emodendroket4mo ago

Yeah I think Sonnet is still the best in my experience but the limits are so stingy I find it hard to recommend for personal use.

visioninmyblood4mo ago

consumer4514mo ago

I have a side-project prototype app that I tried to build on the Gemini 2.5 Pro API. I have not tried 3 yet, however the only improvements I would like to see is in Gemini's ability to:

1. Follow instructions consistently

2. API calls to not randomly result in "resource exhausted"

Can anyone share their experience with either of these issues?

I have built other projects accessing Azure GPT-4.1, Bedrock Sonnet 4, and even Perplexity, and those three were relatively rock solid compared to Gemini.

herrvogel-4mo ago

What you describe could also be the difference in the hallucination rate [0]. Opus 4.5 has the lead here and Gemini 3 Pro performs here quite bad compared to the other benchmarks.

[0] https://artificialanalysis.ai/?omniscience=omniscience-hallu...

rustystump4mo ago

Gemini 3 was awful when i gave it a spin. It was worse than cursor’s composer model.

Claude is still a go to but i have found that composer was “good enough” in practice.

arendtio3mo ago

I think the 'Agentic coding SWE-Bench Verified' [1] was actually the one benchmark where Google didn't even claim to beat Sonnet 4.5 ;-)

[1] https://deepmind.google/models/gemini/pro/

lossolo4mo ago

lxgr4mo ago

> I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5.

1 more reply

verdverm4mo ago

> played around with

You'll never get an accurate comparison if you only play

We know by now that it takes time to "get to know a model and it's quirks"

So if you don't use a model and cannot get equivalent outputs to your daily driver, that's expected and uninteresting

827a4mo ago

1 more reply

rishabhaiover4mo ago

I suspect Cursor is not the right platform to write code on. IMO, humans are lazy and would never code on Cursor. They default to code generation via prompt which is sub-optimal.

viraptor4mo ago

> They default to writing code via prompt generation which is sub-optimal.

What do you mean?

1 more reply

Squarex4mo ago

I have heard that gemini 3 is not that great in cursor, but excellent in Antigravity. I don't have a time to personally verify all that though.

config_yml4mo ago

2 more replies

itsdrewmiller4mo ago

koakuma-chan4mo ago

Nothing is great in Cursor.

vanviegen4mo ago

It's just not great at coding, period. In Antigravity it takes insane amounts of time and tokens for tasks that copilot/sonnet would solve in 30 seconds.

It generates tokens pretty rapidly, but most of them are useless social niceties it is uttering to itself in it's thinking process.

incoming12114mo ago

I think gemini 3 is hot garbage in everything. Its great on a greenfield trying to 1 shot something, if you're working on a long term project it just sucks.

UltraSane4mo ago

I've had Gemini 3 Pro solve issues that Claude Code failed to solve after 10 tries. It even insulted some code that Sonnet 4.5 generated

victor90004mo ago

I'm also finding Gemini 3 (via Gemini CLI) to be far superior to Claude in both quality and availability. I was hitting Claude limits every single day, at that point it's literally useless.

1 more reply

typpilol4mo ago

Same here. Gemini just rips shit out and doesn't understand the flow well between event based components either

rw24mo ago

Gemini 3 in antigravity is amazing

screye4mo ago

Gemini being terrible in Cursor is a well known problem.

Unfortunately, for all its engineers, Google seems the most incompetent at product work.

arresin4mo ago

Gemini pro 3 was a let down for me too

poszlem4mo ago

jjcm4mo ago

I'm curious if this was a deliberate effort on their part, and if they found in testing it provided better output. It's still behind other models clearly, but nonetheless it's fascinating.

1 more reply

enraged_camel4mo ago

>> I'll execute.

>> Wait, what if...?

>> I'll execute.

Suffice it to say I've switched back to Sonnet as my daily driver. Excited to give Opus a try.

1 more reply

behnamoh4mo ago

i’ve tried Gemini in Google AI studio as well and was very disappointed by the superficial responses it provided. It seems like at the level of GPT-5-low or even lower.

On the other hand, it’s a truly multi modal model whereas Claude remains to be specifically targeted at coding tasks, and therefore is only a text model.

dave1010uk4mo ago

The Claude Opus 4.5 system card [0] is much more revealing than the marketing blog post. It's a 150 page PDF, with all sorts of info, not just the usual benchmarks.

There's a big section on deception. One example is Opus is fed news about Anthropic's safety team being disbanded but then hides that info from the user.

[0] https://www.anthropic.com/claude-opus-4-5-system-card

[1] https://dave.engineer/blog/2025/11/claude-opus-4.5-system-ca...

aurareturn4mo ago

  Pages 22–24 of Opus’s system card provide some evidence for this. Anthropic run a multi-agent search benchmark where Opus acts as an orchestrator and Haiku/Sonnet/Opus act as sub-agents with search access. Using cheap Haiku sub-agents gives a ~12-point boost over Opus alone.

Will this lead to another exponential in capabilities and token increase in the same order as thinking models?

1 more reply

bnchrch4mo ago

Seeing these benchmarks makes me so happy.

Not because I love Anthropic (I do like them) but because it's staving off me having to change my Coding Agent.

This world is changing fast, and both keeping up with State of the Art and/or the feeling of FOMO is exhausting.

Ive been holding onto Claude Code for the last little while since Ive built up a robust set of habits, slash commands, and sub agents that help me squeeze as much out of the platform as possible.

But with the last few releases of Gemini and Codex I've been getting closer and closer to throwing it all out to start fresh in a new ecosystem.

Thankfully Anthropic has come out swinging today and my own SOP's can remain in tact a little while longer.

hakanderyal4mo ago

I think we are at the point where you can reliably ignore the hype and not get left behind. Until the next breakthrough at least.

I've just ignored all the people pushing codex for the last weeks.

Don't fall into that trap and you'll be much more productive.

2 more replies

tordrt4mo ago

I tried codex due to the same reasoning you list. The grass is not greener on the other side.. I usually only opt for codex when my claude code rate limit hits.

bavell4mo ago

Same boat and same thoughts here! Hope it holds its own against the competition, I've become a bit of a fan of Anthropic and their focus on devs.

diego_sandoval4mo ago

1 more reply

adriand4mo ago

Don't throw away what's working for you just because some other company (temporarily) leapfrogs Anthropic a few percent on a benchmark. There's a lot to be said for what you're good at.

I also really want Anthropic to succeed because they are without question the most ethical of the frontier AI labs.

2 more replies

wahnfrieden4mo ago

You need much less of a robust set of habits, commands, sub agent type complexity with Codex. Not only because it lacks some of these features, it also doesn't need them as much.

sothatsit4mo ago

1 more reply

Stevvo4mo ago

With Cursor or Copilot+VSCode, you get all the models, can switch any time. When a new model is announced its available same day.

1 more reply

edf134mo ago

I’m threw a few hours at Codex the other day and was incredibly disappointed with the outcome…

I’m a heavy Claude code user and similar workloads just didn’t work out well for me on Codex.

2 more replies

futureshock4mo ago

https://arcprize.org/leaderboard

https://arcprize.org/blog/oai-o3-pub-breakthrough

energy1234mo ago

stavros4mo ago

On-topic, I love the fact that Opus is now three times cheaper. I hope it's available in Claude Code with the Pro subscription.

vunderba4mo ago

mscbuck4mo ago

bryanlarsen4mo ago

On Friday my Claude was particularly stupid. It's sometimes stupid, but I've never seen it been that consistently stupid. Just assumed it was a fluke, but maybe something was changing.

beydogan4mo ago

100% dumber, especially since last 3-4 days. I have two guesses:

- They make it dumber close to a new release to hype the new model

- They gave $1000 Claude Code Web credits to a lot of people, which increased the load a lot so they had to serve quantized version to handle the it.

I love Claude models but I hate this non transparency and instability.

kjgkjhfkjf4mo ago

My guess is that Claude's "bad days" are due to the service becoming overloaded and failing over to use cheaper models.

bamboozled4mo ago

Noticed it hard today, it's just "stupid" now.

simonw4mo ago

Notes and two pelicans: https://simonwillison.net/2025/Nov/24/claude-opus/

tkgally4mo ago

I added Opus 4.5 to my benchmark of 30 alternatives to your now-classic pelican-bicycle prompt (e.g., “Generate an SVG of a dragonfly balancing a chandelier”). Nine models are now represented:

https://gally.net/temp/20251107pelican-alternatives/index.ht...

2 more replies

dreis_sw4mo ago

1 more reply

jasonjmcghee4mo ago

Did you write the terminal -> html converter (how you display the claude code transcripts), or is that a library?

1 more reply

nojs4mo ago

> Thinking blocks from previous assistant turns are preserved in model context by default

This seems like a huge change no? I often use max thinking on the assumption that the only downside is time, but now there’s also a downside of context pollution

1 more reply

pjm3314mo ago

i think you have an error there about haiku pricing

> For comparison, Sonnet 4.5 is $3/$15 and Haiku 4.5 is $4/$20.

i think haiku should be $1/$5

1 more reply

throwaway20274mo ago

I wonder if at this point they read what people use to benchmark with and specifically train it to do well at this task.

diego_sandoval4mo ago

:%s/There model/Their model/g

jasonthorsness4mo ago

[1] https://github.com/jasonthorsness/tree-dangler

diego_sandoval4mo ago

What IDE/CLI tool do you use?

1 more reply

hu34mo ago

Gemini is indeed great for frontend HTML + CSS and even some light DOM manipulation in JS.

I have been using Gemini 2.5 and now 3 for frontend mockups.

When I'm happy with the result, after some prompt massage, I feed it to Sonnet 4.5 to build full stack code using the framework of the application.

redfloatplane4mo ago

Robdel124mo ago

Right! I thought this at the very bottom was super interesting

1 more reply

agentifysh4mo ago

wish they really bolded that part because i almost passed off on it until i read the blog carefully

instant upgrade to claude max 20x if they give opus 4.5 out like this

i still like codex-5.1 and will keep it.

gemini cli missed its opportunity again now money is hedged between codex and claude.

MaxLeiter4mo ago

We've added support for opus 4.5 to v0 and users are making some pretty impressive 1-shots:

https://x.com/mikegonz/status/1993045002306699704

https://x.com/MirAI_Newz/status/1993047036766396852

https://x.com/rauchg/status/1993054732781490412

pwython4mo ago

You're right to call that out. Looking back at what happened:

  1. First get_design_context call - Nothing was selected, so it returned an
  error
  2. Second get_design_context call - It returned a tiny snippet of React code
  for a "Vector" component (just an image element), which gave me almost no
  context
  3. First get_screenshot call - The image returned was essentially blank/empty
  (the output showed <output_image> with what appears to be no actual visual
  content)

  I then hallucinated an entire design description based on:
  - The project context from CLAUDE.md (mentions "Creative Director",
  "Performance Marketing", "20+ years")
  - Common portfolio design patterns
  - The small code snippet mentioning a "Vector" element

  That was a significant error on my part. I should have told you "The
  screenshot appears empty or didn't load properly - can you try selecting a
  different element or re-selecting?" instead of fabricating a description.

jumploops4mo ago

> Pricing is now $5/$25 per million [input/output] tokens

So it’s 1/3 the price of Opus 4.1…

> [..] matches Sonnet 4.5’s best score on SWE-bench Verified, but uses 76% fewer output tokens

…and potentially uses a lot less tokens?

Excited to stress test this in Claude Code, looks like a great model on paper!

alach114mo ago

This is the biggest news of the announcement. Prior Opus models were strong, but the cost was a big limiter of usage. This price point still makes it a "premium" option, but isn't prohibitive.

jmkni4mo ago

> Pricing is now $5/$25 per million tokens

For anyone else confused, it's input/output tokens

$5 for 1million tokens in $25 for 1million tokens out

2 more replies

dent94mo ago

bigmadshoe4mo ago

How could price per token not be a concern for any “multi-billion” or “multi-trillion dollar” business? Do they just burn money to remain profitable?

1 more reply

andai4mo ago

Why do they always cut off 70% of the y-axis? Sure it exaggerates the differences, but... it exaggerates the differences.

And they left Haiku out of most of the comparisons! That's the most interesting model for me. Because for some tasks it's fine. And it's still not clear to me which ones those are.

ximeng4mo ago

It’s a pretty arbitrary y axis - arguably the only thing that matters is the differences.

waynenilsen4mo ago

marketing.

obblekk4mo ago

this is the most interesting time for software tools since compilers and static typechecking was invented.

quantumHazer4mo ago

Last year’s model were at 50-60% on SWE bench-verified actually

1 more reply

keeeba4mo ago

Oh boy, if the benchmarks are this good and Opus feels like it usually does then this is insane.

I’ve always found Opus significantly better than the benchmarks suggested.

LFG

ddxv4mo ago

The LLMs rate of improvement has really slowed down. This looks like a minor improvement in terms of accuracy and big gains from efficiency.

energy1234mo ago

14 months ago we had GPT-4 and now we have models that can get a gold medal at the IMO.

But sure, if you curve fit to the last 3 months you could say things are slowing down, but that's hyper fixating on a very small amount of information.

2 more replies

irthomasthomas4mo ago

elvin_d4mo ago

Great seeing the price reduction. Opus historically was prices at 15/75, this one delivers at 5/25 which is close to Gemini 3 Pro. I hope Anthropic can afford increasing limits for the new Opus.

gigatexal4mo ago

Love the competition. Gemini 3 pro blew me away after being spoiled by Claude for coding things. Considered canceling my Anthropic sub but now I’m gonna hold on to it.

sbinnee4mo ago

They said that they have seen 134K tokens for tool definition alone. That is insane. I also really liked the puzzle game video.

[1] https://www.anthropic.com/engineering/advanced-tool-use

nickandbro4mo ago

"Create me a SVG of a PS4 controller"

Gemini 3.0 Pro: https://www.svgviewer.dev/s/CxLSTx2X

Opus 4.5: https://www.svgviewer.dev/s/dOSPSHC5

I think Opus 4.5 did a bit better overall, but I do think eventually frontier models will eventually converge to a point where the quality will be so good it will be hard to tell the winner.

esperent4mo ago

I can only see the svg code there on mobile. I don't see any way to view the output.

1 more reply

chaosprint4mo ago

SWE's results were actually very close, but they used a poor marketing visualization. I know this isn't a research paper, but for Anthropic, I expect more.

flakiness4mo ago

They should've used an error rate instead of the pass rate. Then it'll get the same visual appeal without cheating.

jaakkonen4mo ago

achierius4mo ago

Was this just 2 hours of the agent running on its own, or was there back-and-forth/any sort of interaction? How much did you have to set up scaffolding, e.g. tests?

ofermend4mo ago

Can't wait to try Opus 4.5

We just evaluated it for Vectara's grounded hallucination leaderboard: it scores at 10.9% hallucination rate, better than Gemini-3, GPT-5.1-high or Grok-4.

https://github.com/vectara/hallucination-leaderboard

andreybaskov4mo ago

threeducks4mo ago

Rough ballpark estimate:

- Amazon Bedrock serves Claude Opus 4.5 at 57.37 tokens per second: https://openrouter.ai/anthropic/claude-opus-4.5

- Amazon Bedrock serves gpt-oss-120b at 1748 tokens per second: https://openrouter.ai/openai/gpt-oss-120b

- gpt-oss-120b has 5.1B active parameters at approximately 4 bits per parameter: https://huggingface.co/openai/gpt-oss-120b

To generate one token, all active parameters must pass from memory to the processor (disregarding tricks like speculative decoding)

Multiplying 1748 tokens per second with the 5.1B parameters and 4 bits per parameter gives us a memory bandwidth of 4457 GB/sec (probably more, since small models are more difficult to optimize).

If we divide the memory bandwidth by the 57.37 tokens per second for Claude Opus 4.5, we get about 80 GB of active parameters.

With speculative decoding, the numbers might change by maybe a factor of two or so. One could test this by measuring whether it is faster to generate predictable text.

Of course, this does not tell us anything about the number of total parameters. The ratio of total parameters to active parameters can vary wildly from around 10 to over 30:

    120 : 5.1 for gpt-oss-120b
    30 : 3 for Qwen3-30B-A3B
    1000 : 32 for Kimi K2
    671 : 37 for DeepSeek V3

Even with the lower bound of 10, you'd have about 800 GB of total parameters, which does not fit into the 512 GB RAM of the M3 Ultra (you could chain multiple, at the cost of buying multiple).

2 more replies

docjay4mo ago

-or-

“ 62× RTX 4090 (24GB) or 16× H100 (80GB) or 13× M3 Max (128GB) “

So ~1.1TB. Of course it can be quantized down to as dumb as you can stand, even within ~250GB (https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-l...).

But again, that’s for speed. You can run them more-or-less straight off the disk, but (~1TB / SSD_read_speed + computation_time_per_chunk_in_RAM) = a few minutes per ~word or punctuation.

1 more reply

nickandbro4mo ago

I use the following models like so nowadays:

Gemini is great, when you have gitingested the code of pypi package and want to use it as context. This comes in handy for tasks and repos outside the model's training data.

swapnilt4mo ago

sync4mo ago

Does anyone here understand "interleaved scratchpads" mentioned at the very bottom of the footnotes:

> All evals were run with a 64K thinking budget, interleaved scratchpads, 200K context window, default effort (high), and default sampling settings (temperature, top_p).

[0] https://arxiv.org/abs/2112.00114

dheerkt4mo ago

based on their past usage of "interleaved tool calling" it means that the tool can be used while the model is thinking.

https://aws.amazon.com/blogs/opensource/using-strands-agents...

1 more reply

alvis4mo ago

Havoc4mo ago

Interesting that the number of hn comments on big model announcements seems to be dropping. I recall previous ones easily surpassing 1k

Maybe models are starting to get good enough/ levelling off?

sd94mo ago

It's fatigue. This is the third major model announcement in the last week.

On the other hand, this is the one I'm most excited by. I wouldn't have commented at all if it wasn't for your comment. But I'm excited to start using this.

1 more reply

lm284694mo ago

"we gained 2.7% in these artificial benchmarks and here is a picture of a pelican on a bicycle, get excited and give us $7 trillion please"

morgengold4mo ago

I'm on a Claude Code Max subscription. Last days have been a struggle with Sonnet 4.5 - Now it switched to Claude Opus 4.5 as default model. Ridiculous good and fast.

starkparker4mo ago

Would love to know what's going on with C++ and PHP benchmarks. No meaningful gain over Opus 4.1 for either, and Sonnet still seems to outperform Opus on PHP.

aliljet4mo ago

ximeng4mo ago

syspec4mo ago

Thank you Claude.

GenerWork4mo ago

gsibble4mo ago

They lowered the price because this is a massive land grab and is basically winner take all.

I love that Antrhopic is focused on coding. I've found their models to be significantly better at producing code similar to what I would write, meaning it's easy to debug and grok.

Gemini does weird stuff and while Codex is good, I prefer Sonnet 4.5 and Claude code.

rw24mo ago

Gemini 3 in antigravity is significantly better than Claude code with either Opus or Sonnet that I struggle to see how they can compete. And I'm someone with the 100 dollar/month plan.

I can't even use Opus for a day before it runs out before. This will make it better but Antigravity has way better UI and also bug solving.

pingou4mo ago

What causes the improvements in new AI models recently? Is it just more training, or is it new, innovative techniques?

I_am_tiberius4mo ago

saaaaaam4mo ago

jmward014mo ago

adidoit4mo ago

Tested this building some PRs and issues that codex-5.1-max and gemini-3-pro were strugglig with

It planned way better in a much more granular way and then execute it better. I can't tell if the model is actually better or if it's just planning with more discipline

viraptor4mo ago

Mkengin4mo ago

I like this one: https://swe-rebench.com/

agentifysh4mo ago

again the question of concern as codex user is usage

its hard to get any meaningful use out of claude pro

after you ship a few features you are pretty much out of weekly usage

compared to what codex-5.1-max offers on a plan that is 5x cheaper

the 4~5% improvement is welcome but honestly i question whether its possible to get meaningful usage out of it the way codex allows it

for most use cases medium or 4.5 handles things well but anthropic seems to have way less usage limits than what openai is subsidizing

until they can match what i can get out of codex it won't be enough to win me back

edit: I upgraded to claude max! read the blog carefully and seems like opus 4.5 is lifted in usage as well as sonnet 4.5!

jstummbillig4mo ago

Well, that's where the price reduction comes in handy, no?

2 more replies

mutewinter4mo ago

Some early visual evaluations: https://x.com/mutewinter/status/1993037630209192276

adastra224mo ago

carcabob4mo ago

I wish the article's graphs weren't distorted by skipping so much of the scale to make it look like a more significant difference than it is. But it does looks impressive.

maherbeg4mo ago

Ok, the victorian lock puzzle game is pretty damn cool way to showcase the capabilities of these models. I kinda want to start building similar puzzle games for models to solve.

PilotJeff4mo ago

More blowing up of the bubble with anthropic essentially offering compute/LLM for below cost. Eventually the laws of physics/market will take over and look out below.

jstummbillig4mo ago

How would you know what the cost is?

jedberg4mo ago

Up until today, the general advice was use Opus for deep research, use Haiku for everything else. Given the reduction in cost here, does that rule of thumb no longer apply?

mudkipdev4mo ago

In my opinion Haiku is capable but there is no reason to use anything lower than Sonnet unless you are hitting usage limits

sd94mo ago

After experimenting with Gemini 3, I still felt like Sonnet 4.5 had the edge. So I'm very excited to start playing with this in the wild.

ramon1564mo ago

I've almost ran out of Claude on the Web credits. If they announce that they're going to support Opus then I'm going to be sad :'(

undeveloper4mo ago

haven't they all expired by now?

adt4mo ago

https://lifearchitect.ai/models-table/

alvis4mo ago

What surprise me is that Opus 4.5 lost all reasoning scores to Gemini and GPT. I thought it’s the area the model will shine the most

thot_experiment4mo ago

It's really hard for me to take these benchmarks seriously at all, especially that first one where Sonnet 4.5 is better at software engineering than Opus 4.1.

mirsadm4mo ago

1 more reply

AbstractH244mo ago

Amazing how every company's newest model performs best in the benchmarks they share in the announcment....

johnnycombin4mo ago

the most overhyped model ever, not even close to Gemini3 or GPT5.1 after 8h of complex tasks.

winrid4mo ago

So far this seems like a huge downgrade from Opus 4.1. Please add back 4.1 as an option...

t0lo4mo ago

So are we in agreement that claude is the thinking persons model and openai is for the masses

whitepoplar4mo ago

Does the reduced price mean increased usage limits on Claude Code (with a Max subscription)?

skerit4mo ago

Yes. Opus is now the default model in Claude Code. And Opus 4.5 counts the same toward your usage limit as Sonnet 4.5 did.

Even better: Sonnet 4.5 now has its own separate limit.

rishabhaiover4mo ago

Is this available on claude-code?

elvin_d4mo ago

Yes, the first run was nice - feels faster than 4.1 and did what Sonnet 4.5 struggled to execute properly.

greenavocado4mo ago

What are you thinking of trying to use it for? It is generally a huge waste of money to unleash Opus on high content tasks ime

2 more replies

rishabhaiover4mo ago

damn, I need a MAX sub for this.

1 more reply

andai4mo ago

This one is different. IYKYK...

rutagandasalim4mo ago

claude opus 4.5 is an incredible model i just one-shoted https://aithings.dev with it

jstummbillig4mo ago

What was the prompt?

CuriouslyC4mo ago

I_am_tiberius4mo ago

Still mad at them because they decided not to take their users' privacy serious. Would be interested how the new model behaves, but just have a mental lock and can't sign up again.

quantummagic4mo ago

I would look past their privacy issues and have wanted to sign up for over a year, but don't have a cellphone, which is required to register.

throwaway20274mo ago

Oh that's why there were only 2 usage bars.

xkbarkar4mo ago

This is great. Sonnet 4.5 has degraded terribly.

I can get some useful stuff from a clean context in the web ui but the cli is just useless.

Opus is far superiour.

Today sonnet 4.5 suggested to verify remote state file presence by creating an empty one locally and copy it to the remote backend. Da fuq? University level programmer my a$$.

And it seems like it has degraded this last month.

I keep getting braindead suggestions and code that looks like it came from a random word generator.

I swear it was not that awful a couple of months ago.

Opus cap has been an issue, happy to change and I really hope the nerf rumours are just that. Undounded rumours and the defradation has a valid root cause

But honestly sonnet 4.5 has started to act like a smoking pile of sh**t

idonotknowwhy4mo ago

>This is great. Sonnet 4.5 has degraded terribly. >I can get some useful stuff from a clean context in the web ui but the cli is just useless. >I swear it was not that awful a couple of months ago.

I agree on all 3 counts. And it still degrades after a few long turns in openwebui. You can test this by regenerating the last reply in chats from shortly after the model was released.

kachapopopow4mo ago

slightly better at react and spacial logic than gemini 3 pro, but slower and way more expensive.

synergy204mo ago

great, paying $100/m for claude code, this stops me from switching to gemini 3.0 for now.

clbrmbr4mo ago

GodelNumbering4mo ago

The fact that the post singled out SWE-bench at the top makes the opposite impression that they probably intended.

grantpitt4mo ago

do say more

1 more reply

cyrusradfar4mo ago

Given this tech is new, the experience of how we relate to their mistakes is something I think a bit about.

Am I alone here, are others finding themselves more forgiving of "their preferred" model provider?

irthomasthomas4mo ago

I guess you where not around a few months back when they over-optimized and served a degraded model for weeks.

1 more reply

tschellenbach4mo ago

Ok, but can it play Factorio?

fragmede4mo ago

Got the river crossing one:

https://claude.ai/chat/0c583303-6d3e-47ae-97c9-085cefe14c21

Still fucked up one about the boy and the surgeon though:

https://claude.ai/chat/d2c63190-059f-43ef-af3d-67e7ca1707a4

AJRF4mo ago

that chart at the start is egregious

tildef4mo ago

Feels like a tongue-in-cheek jab at the GPT-5 announcement chart.

robertwt74mo ago

0x79de4mo ago

this is quite a good

lerp-io4mo ago

80% and 77% is not that much lol

zb34mo ago

The first chart is straight from "how to lie in charts"..

1 more reply

j / k navigate · click thread line to collapse