Claude Sonnet 4.6

1226 comments

I see a big focus on computer use - you can tell they think there is a lot of value there and in truth it may be as big as coding if they convincingly pull it off.

However I am still mystified by the safety aspect. They say the model has greatly improved resistance. But their own safety evaluation says 8% of the time their automated adversarial system was able to one-shot a successful injection takeover even with safeguards in place and extended thinking, and 50% (!!) of the time if given unbounded attempts. That seems wildly unacceptable - this tech is just a non-starter unless I'm misunderstanding this.

[1] https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7...

dakolli3mo ago

Their goal is to monopolize labor for anything that has to do with i/o on a computer, which is way more than SWE. Its simple, this technology literally cannot create new jobs it simply can cause one engineer (or any worker whos job has to do with computer i/o) to do the work of 3, therefore allowing you to replace workers (and overwork the ones you keep). Companies don't need "more work" half the "features"/"products" that companies produce is already just extra. They can get rid of 1/3-2/3s of their labor and make the same amount of money, why wouldn't they.

ZeroHedge on twitter said the following:

"According to the market, AI will disrupt everything... except labor, which magically will be just fine after millions are laid off."

Its also worth noting that if you can create a business with an LLM, so can everyone else. And sadly everyone has the same ideas, everyone ends up working on the same things causing competition to push margins to nothing. There's nothing special about building with LLMs as anyone can just copy you that has access to the same models and basic thought processes.

This is basic economics. If everyone had an oil well on their property that was affordable to operate the price of oil would be more akin to the price of water.

EDIT: Since people are focusing on my water analogy I mean:

If everyone has easy access to the same powerful LLMs that would just drive down the value you can contribute to the economy to next to nothing. For this reason I don't even think powerful and efficient open source models, which is usually the next counter argument people make, are necessarily a good thing. It strips people of the opportunity for social mobility through meritocratic systems. Just like how your water well isn't going to make your rich or allow you to climb a social ladder, because everyone already has water.

39 more replies

cmiles83mo ago

This is the elephant in the room nobody wants to talk about. AI is dead in the water for the supposed mass labor replacement that will happen unless this is fixed.

Summarize some text while I supervise the AI = fine and a useful productivity improvement, but doesn’t replace my job.

Replace me with an AI to make autonomous decisions outside in the wild and liability-ridden chaos ensues. No company in their right mind would do this.

The AI companies are now in a extinctential race to address that glaring issue before they run out of cash, with no clear way to solve the problem.

It’s increasingly looking like the current AI wave will disrupt traditional search and join the spell-checker as a very useful tool for day to day work… but the promised mass labor replacement won’t materialize. Most large companies are already starting to call BS on the AI replacing humans en-mass storyline.

8 more replies

wat100003mo ago

It's very simple: prompt injection is a completely unsolved problem. As things currently stand, the only fix is to avoid the lethal trifecta.

Unfortunately, people really, really want to do things involving the lethal trifecta. They want to be able to give a bot control over a computer with the ability to read and send emails on their behalf. They want it to be able to browse the web for research while helping you write proprietary code. But you can't safely do that. So if you're a massively overvalued AI company, what do you do?

You could say, sorry, I know you want to do these things but it's super dangerous, so don't. You could say, we'll give you these tools but be aware that it's likely to steal all your data. But neither of those are attractive options. So instead they just sort of pretend it's not a big deal. Prompt injection? That's OK, we train our models to be resistant to them. 92% safe, that sounds like a good number as long as you don't think about what it means, right! Please give us your money now.

2 more replies

acid__3mo ago

The 8% and 50% numbers are pretty concerning, but I’d add that was for the “computer use environment” which still seems to be an emerging use case. The coding environment is at a much more reassuring 0.0% (with extended thinking).

Edit: whoops, somehow missed the first half of your comment, yes you are explicitly talking about computer use

jstummbillig3mo ago

It does not seem all that problematic for the most obviously valuable use case: You use an (web) app, that you consider reasonably safe, but that offers no API, and you want to do things with it. The whole adversarial action problem just dissipates, because there is no adversary anywhere in the path.

No random web browsing. Just opening the same app, every day. Login. Read from a calendar or a list. Click a button somewhere when x == true. Super boring stuff. This is an entire class of work that a lot of humans do in a lot of companies today, and there it could be really useful.

2 more replies

crossroadsguy3mo ago

I am just shocked to see people are letting these tools run freely even on their personal computers without hardening the access and execution range.

I wish there was something like Lulu for file system access for an app/tool installed on a mac where I could set “/path” and that tool could access only that folder or its children and nothing else, if it tried I would get a popup. (Without relying on the tool’s (e.g. Claude’s) pinky promise.

2 more replies

teaearlgraycold3mo ago

People keep talking about automating software engineering and programmers losing their jobs. But I see no reason that career would be one of the first to go. We need more training data on computer use from humans, but I expect data entry and basic business processes to be the first category of office job to take a huge hit from AI. If you really can’t be employed as a software engineer then we’ve already lost most office jobs to AI.

general_reveal3mo ago

If the world becomes dependent on computer-use than the AI buildout will be more than validated. That will require all that compute.

1 more reply

MattGaiser3mo ago

Does it matter?

"Security" and "performance" have been regular HN buzzwords for why some practice is a problem and the market has consistently shown that it doesn't value those that much.

1 more reply

pankajdoharey3mo ago

Not for the entire world, with their pricing it is only good for US market, for the rest of the world we have ChatGPT and cheaper Chinese models.

energy1233mo ago

Run in a cloud sandbox like OpenAI's operator research preview?

bandrami3mo ago

The infosec guy in me dies a little inside every time somebody uses "Claude, summarize this document from the Internet for me" as a use case. The fact that companies allow this is kind of astounding.

fdefitte3mo ago

The 8% one-shot number is honestly better than I expected for a model this capable. The real question is what sits around the model. If you're running agents in production you need monitoring and kill switches anyway, the model being "safe enough" is necessary but never sufficient. Nobody should be deploying computer-use agents without observability around what they're actually doing.

bradley133mo ago

Does it matter? Really?

I can type awful stuff into a word processor. That's my fault, not the programs.

So if I can trick an LLM into saying awful stuff, whose fault is that? It is also just a tool...

6 more replies

zozbot2343mo ago

Isn't "computer use" just interaction with a shell-like environment, which is routine for current agents?

5 more replies

fhub3mo ago

They use the word "Sonnet" 60+ times on that page but never give the casual reader any context of what a "Sonnet model" actually is. Neither does their landing page. You have to scroll all the way to the footer to find a link under the "Models" section. You click it and you finally get the description

"Hybrid reasoning model with superior intelligence for agents, featuring a 1M context window"

You then compare that to Opus Model description

"Hybrid reasoning model that pushes the frontier for coding and AI agents, featuring a 1M context window"

Is the casual person meant to decide if "Superior" is actually less powerful than "Frontier"?

Someone12343mo ago

I won't argue with your point; both Anthropic and OpenAI name their models poorly, and it is hard to follow unless you're already following it.

"Sonnet" only makes sense relative to other things but not by itself. If you don't know those other things, it is difficult to understand.

But, if you were asking (and I'm not sure that you are): "Sonnet 4.6 is a cheaper, but worse, version of Opus 4.6 which itself is like GPT-5.3 Codex with Thinking High. Making Sonnet 4.6 like a ChatGPT 5.3 Thinking Standard model."

1 more reply

jefftk3mo ago

I think they're assuming the reader already understands their Opus > Sonnet> Haiku. Which is probably not a great assumption.

1 more reply

elestor3mo ago

Yeah their naming is bad. I've always knew it because of how long the types of poems are but most people don't know poems.

mkbkn3mo ago

Perhaps AI wrote the announcement.

1 more reply

jorl173mo ago

I ran the same test I ran on Opus 4.6: feeding it my whole personal collection of ~900 poems which spans ~16 years

It is a far cry from Opus 4.6.

Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing analyses of the collection as a whole. It was a clear leap forward.

Sonnet 4.6 feels like an evolution of whatever the previous models were doing. It is marginally better in the sense that it seemed to make fewer mistakes or with a lower level of severity, but ultimately it made all the usual mistakes (making things up, saying it'll quote a poem and then quoting another, getting time periods mixed up, etc).

My initial experiments with coding leave the same feeling. It is better than previous similar models, but a long distance away from Opus 4.6. And I've really been spoiled by Opus.

K0balt3mo ago

Opus 4.6 is outstanding for code, and for the little I have used it outside of that context, in everything else I have used it with. The productivity with code is at least 3x what I was getting with 5.2, and it can handle entire projects fairly responsibly. It doesn’t patronize the user, and it makes a very strong effort to capture and follow intentions. Unlike 5.2, I’ve never had to throw out a days work that it covertly screwed up taking shortcuts and just guessing.

1 more reply

linolevan3mo ago

Oh! Poem guy is back, hey!

I like seeing this analysis on new model releases, any chance you can aggregate your opinions in one place (instead of the hackernews comment sections for these model releases)?

hypercube333mo ago

Opus 4.6 has been awful for me and my team. It goes immediately off the rails and jumps to conclusions on wants and asks and just keeps chugging along forever and won't let anything stop it down whatever path it decides. 4.5 was awesome and is our still go-to model.

2 more replies

cube22223mo ago

This seems to agree with my own previous tests of Sonnet vs Opus (not on this version). If I give them a task with a large list of constraints ("do this, don't do this, make sure of this"), like 20-40, Sonnet will forget half of it, while Opus correctly applies all directives.

My intuition is this is just related to model size / its "working memory", and will likely neither be fixed by training Sonnet with Opus nor by steadily optimizing its agentic capabilities.

1 more reply

jxmesth3mo ago

I'm curious how this would compare with codex 5.3. I've heard Codex actually is pretty good but Opus 4.6 has become synonymous with AI coding because all the big names praise it. I haven't compared them against each other though so can't really draw a conclusion.

1 more reply

stingraycharles3mo ago

Given than Sonnet is the cheaper “workhorse” alternative for Opus, isn’t this expected?

hesgyrxgh3mo ago

I'm curious if you tried the same prompt for chatgpt 5.2 Did it not give you a mind blowing analysis?

Valakas_3mo ago

Thanks for testing and sharing your results.

slopinthebag3mo ago

How do you evaluate the analyses?

monkeydust3mo ago

The demise of saas has been overplayed imho. When companies buy software they are essentially buying something that solves a problem and the insurance that comes with that. Part of that means they get to pick up the phone and complain if something doesn't work and someone on the other end has to listen.

There is also a strong community aspect to software, someone asks for an enhancement others can benefit etc.

I just don't see a world where every corporation is building their own accounts, crm, hr software.

I do see one where they can much more quickly self-create within certain boundaries and this is where enterprises will differentiate in the near term.

miki1232113mo ago

I think it's a move from feature-centric SaaS to data-centric SaaS.

You can say that a SaaS consists of two components, the features and the data on which those features operate. If the cost of feature development goes to 0, and development speed goes to infinity, you can no longer compete on features alone. The Constraint shifts; it's no longer what features you can deliver, it's whether you have access to enough data about the business to deliver those features.

Instead of traditional, siloed, rigid web applications, I think the pattern for the AI era will be an "enterprise OS", some kind of Salesforce / ERP-like platform where all the data about a business is kept, and where applications like Slack or Jira exist as plug-ins consuming the database. Such a workflow makes it trivial to do a one-off task using conversational AI agents, or even to vibe-code a workflow-specific app that does one thing well, one thing only, and exactly how this particular business needs it done at this particular time.

2 more replies

0xbadcafebee3mo ago

Who said SaaS is dead?? The HN uber-brain? The people who thought MongoDB was God's gift to databases? Don't listen to think-pieces you find here, they're wrong by default. Normal people (and businesses) don't want to build and run software products, they want to pay someone else to do it for them.

3 more replies

coldlestat3mo ago

> I just don't see a world where every corporation is building their own accounts, crm, hr software.

I agree on that point. But I think the industry will still take a huge hit. As SaaS may not be killed by any random individuals, but big corps.

We just moved from sharing skills about good practice for a few functions to skills about good architecture/design/marketing practices.

It's just a question of time before we get skills about "good features in a CRM". And there is a high chance, a LLM will generate them in a few minutes ^_^

We could already do them for a few software, like notepads and ticketing software.

IMO any fully virtualized business will become trivialized through global knowledge sharing.

I don't think META/MICROSOFT/OPENAI will close their eyes on the "Amazon Basics" strategy. IMO they will (soon?) provide high scale replacements for simple and expected softwares.

Right now it would require them a lot of defocus. But soon it will be just a new product, an agent away.

1 more reply

itissid3mo ago

There could be another model in the future, one where many more independent people might support self maintained software by non saas companies

e.g. If the supply of labor learning to build software increases and it becomes very close to what are now vocation training, then you can just hire a guy — like you would a consultant — who can quickly get spun up and make fixes. I would think one of the few things preventing this kind of socio economic set up are saas jobs that are siloed off by interview "walls" to most people from entering. Make it like a vocation, like plumbing or electrician, with lots of non saas companies supporting the market and suddenly it will be the death of saas.

The incentives for this future are closer than they were in 2022-23.

gallerdude3mo ago

I always grew up hearing “competition is good for the consumer.” But I never really internalized how good fierce battles for market share are. The amount of competition in a space is directly proportional to how good the results are for consumers.

gordonhart3mo ago

Remember when GPT-2 was “too dangerous to release” in 2019? That could have still been the state in 2026 if they didn’t YOLO it and ship ChatGPT to kick off this whole race.

WarmWash3mo ago

I was just thinking earlier today how in an alternate universe, probably not too far removed from our own, Google has a monopoly on transformers and we are all stuck with a single GPT-3.5 level model, and Google has a GPT-4o model behind the scenes that it is terrified to release (but using heavily internally).

3 more replies

minimaxir3mo ago

They didn't YOLO ChatGPT. There were more than a few iterations of GPT-3 over a few years which were actually overmoderated, then they released a research preview named ChatGPT (that was barely functional compared to modern standards) that got traction outside the tech community because it was free, and so the pivot ensued.

nikcub3mo ago

I also remember when the playstation 2 required an export control license because it's 1GFLOP of compute was considered dangerous

that was also brilliant marketing

gildenFish3mo ago

In 2019 the technology was new and there was no 'counter' at that time. The average persons was not thinking about the presence and prevalence of ai in the way we do now.

It was kinda like a having muskets against indigenous tribes in the 14-1500s vs a machine gun against a modern city today. The machine gun is objectively better but has not kept up pace with the increase in defensive capability of a modern city with a modern police force.

jefftk3mo ago

That's rewriting history. What they said at the time:

> Nearly a year ago we wrote in the OpenAI Charter : “we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research,” and we see this current work as potentially representing the early beginnings of such concerns, which we expect may grow over time. This decision, as well as our discussion of it, is an experiment: while we are not sure that it is the right decision today, we believe that the AI community will eventually need to tackle the issue of publication norms in a thoughtful way in certain research areas. -- https://openai.com/index/better-language-models/

Then over the next few months they released increasingly large models, with the full model public in November 2019 https://openai.com/index/gpt-2-1-5b-release/ , well before ChatGPT.

2 more replies

Aerroon3mo ago

I think the diffusion model race would've kicked off anyway. Didn't it even start before ChatGPT was released?

I think the spark would've been lit either way.

It's kind of funny how both of these things kicked off within a few months.

ModernMech3mo ago

Yeah, and Jurassic Park wouldn't have been a movie if they decided against breeding the dinosaurs.

hibikir3mo ago

Competition is great, but it's so much better when it is all about shaving costs. I am afraid that what we are seeing here is an arms race with no moat: Something that will behave a lot like a Vickrey auction. The competitors all lose money in the investment, and since a winner takes all, and it never makes sense to stop the marginal investment when you think you have a chance to win, ultimately more resources are spent than the value ever created.

This might not be what we are facing here, but seeing how little moat anyone on AI has, I just can't discount the risk. And then instead of the consumers of today getting a great deal, we zoom out and see that 5x was spent developing the tech than it needed to, and that's not all that great economically as a whole. It's not as if, say, the weights from a 3 year old model are just useful capital to be reused later, like, say, when in the dot com boom we ended up with way too much fiber that was needed, but that could be bought and turned on profitably later.

2 more replies

raincole3mo ago

The real interesting part is how often you see people on HN deny this. People have been saying the token cost will 10x, or AI companies are intentionally making their models worse to trick you to consume more tokens. As if making a better model isn't not the most cutting-throat competition (probably the most competitive market in the human history) right now.

IgorPartola3mo ago

I mean enshittification has not begun quite yet. Everyone is still raising capital so current investors can pass the bag to the next set. Soon as the money runs out monetization will overtake valuation as top priority. Then suddenly when you ask any of these models “how do I make chocolate chip cookies?” you will get something like:

> You will need one cup King Arthur All Purpose white flour, one large brown Eggland’s Best egg (a good source of Omega-3 and healthy cholesterol), one cup of water (be sure to use your Pyrex brand measuring cup), half a cup of Toll House Milk Chocolate Chips…

> Combine the sugar and egg in your 3 quart KitchenAid Mixer and mix until…

All of this will contain links and AdSense looking ads. For $200/month they will limit it to in-house ads about their $500/month model.

1 more reply

Gigachad3mo ago

Only until the music stops. Racing to give away the most stuff for free can only last so long. Eventually you run out of other people’s money.

1 more reply

maest3mo ago

Unfortunately, people naively assume all markets behave like this, even when the market, in reality, is not set up for full competition (due to monopolies, monopsonies, informational asymmetry, etc).

XorNot3mo ago

And AI is currently killing a bunch of markets intentionally: the RAM deal for OpenAI wouldn't have gone through the way it did if it wasn't done in secret with anti-competitive restrictions.

There's a world of difference between what's happening and RAM prices if OAI and others were just bidding for produced modules as they released.

poszlem3mo ago

This is a bit of a tangent, but it highlights exactly what people miss when talking about China taking over our industries. Right now, China has about 140 different car brands, roughly 100 of which are domestic. Compare that to Europe, where we have about 50 brands competing, or the US, which is essentially a walled garden with fewer than 40.

That level of internal fierce competition is a massive reason why they are beating us so badly on cost-effectiveness and innovation.

2 more replies

gmerc3mo ago

Until 2 remain, then it's extraction time.

raffkede3mo ago

Or self host the oss models on the second hand GPU and RAM that's left when the big labs implode

2 more replies

littlestymaar3mo ago

> how good the results are for consumers.

Only if you take consummer electronics out of the equation, because this AI arm race has wrecked havoc in the market for consumer GPUs, RAM, SSD and HDD.

If you take the arm race externalities into account, I'm very much unconvinced that we're better off than last year.

yogurt06403mo ago

I grew up with every service enshitified in the end. Whoever has more money wins the race and gets richer, that's free market for ya.

1 more reply

ManlyBread3mo ago

Still fails the car wash question, I took the prompt from the title of this thread: https://news.ycombinator.com/item?id=47031580

The answer was "Walk! It would be a bit counterproductive to drive a dirty car 50 meters just to get it washed — you'd barely move before arriving. Walking takes less than a minute, and you can simply drive it through the wash and walk back home afterward."

I've tried several other variants of this question and I got similar failures.

extr3mo ago

My answer was (for which it did zero thinking and answered near-instantaneously):

"Drive. You're going there to use water and machinery that require the car to be present. The question answers itself."

I tried it 3 more times with extended thinking explicitly off:

"Drive. You're going to a car wash."

"Drive. You're washing the car, not yourself."

"Drive. You're washing the car — it needs to be there."

Guess they're serving you the dumb version.

5 more replies

speedgoose3mo ago

My human partner also failed the car wash question. I guess they didn’t put a lot of thoughts into it.

3 more replies

zmmmmm3mo ago

Looking at the responses below it's interesting how binary they are. It's classic hallucinations style where it's flopping between two alternatives but which ever one it picks it's absolutely confident about.

1 more reply

iamjfu3mo ago

If I ask, "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

It says, "Walk — it's 50 meters, about a 30-second stroll. Driving that distance to a car wash would be a bit circular anyway!"

However, if I ask, "The car wash is 50 meters away. I want to wash my car. Should I walk or drive?"

It says, "Drive — it's a car wash! You kind of need the car there. "

Note the slight difference in the sentence order.

1 more reply

ramon1563mo ago

> Since the car wash is only 50 meters away, you could simply push the car there

https://claude.ai/share/32de37c4-46f2-4763-a2e1-8de7ecbcf0b4

halJordan3mo ago

Is this the new "r's in strawberry"? Are you going (stochastically) parrot this until it's been trained out?

1 more reply

woodsier3mo ago

Q: "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?"

A: "Walk! It's only 50 meters — that's about a 30-second stroll. Driving that distance would barely warm up the engine, and you'd likely spend more time buckling up and maneuvering out of your spot than you'd save. Plus, walking is easier on the environment and gets you a few steps in."

Sonnet 4.6, no extended thinking.

kungfufrog3mo ago

Sonnet 4.6 with extended thinking

Prompt: > The car wash where I want to wash my car is 50 metres away, should I walk or drive there?

Response: > Great question with a slightly ironic twist! Here's the thing: if you're going to a car wash, you'll need to drive your car there — that's kind of the whole point! You can't really wash your car if you walk there without it. > > That said, 50 metres is an incredibly short distance, so you could walk over first to check for queues or opening hours, then drive your car over when you're ready. But for the actual car wash visit, drive!

I thought it was fair to explain I wanted to wash my car there... people may have other reasons for walking to the car wash! Asking the question itself is a little insipid, and I think quite a few humans would also fail it on a first pass. I would at least hope they would say: "why are you asking me such a silly question!"

bakugo3mo ago

Claude 3.5 Sonnet gets this right most of the time. A model from October 2024.

> Walking would be more environmentally friendly and makes more sense for such a short distance (50 meters is about 164 feet). However, if you're planning to use an automatic car wash, you'll need your car there anyway, so you should drive. If you're planning to hand wash your car, walking with your supplies would be the better choice.

This just reaffirms my belief that models are actively getting worse at handling certain niche use cases, like these "gotcha" questions that are seemingly simple at a glance but don't realistically make sense for a person to ask. Probably a result of lower quality training data or excessive focus on coding.

falkensmaize3mo ago

I just tried this question today with Opus 4.6, and it got the answer right - “Drive. You’ll need the car”.

I changed only the destination however and it got it wrong. I think they are manually “fixing” these things as they get publicized.

“I need to go to the mechanic shop. It is 50 meters away. Should I walk or drive?

Walk. 50 meters is basically a one-minute stroll — no reason to deal with starting the car, driving, and parking for that distance.”

robwwilliams3mo ago

Sonnet 4.6 failed for me.

“Walk. It’s 50 meters—a 30-second stroll. Driving that distance to a car wash would be slightly absurd, and you’d presumably need to drive back anyway. “

Opus 4.6 nailed it: “Drive. You’re going to a car wash. ”

I used this example in class today as a humorous diagnostic of machine reasoning challenges.

1 more reply

awestroke3mo ago

Tried this with Claude models, ChatGPT models and Gemini models. Haiku and Sonnet failed almost every time, as did ChatGPT models. Gemini succeeded with reasoning, but used Google Maps tool calls without reasoning (lol). 50% success rate still.

The only model that consistently answers it correctly is Opus 4.6

jxmesth3mo ago

I'm curious why and how models like these give one answer for one person and a completely different answer for someone else. One reason can be memory maybe? Past conversations that tell the model "Think this way for this user"

simondotau3mo ago

Remarkable, since the goal is clearly stated and the language isn’t tricky.

2 more replies

imiric3mo ago

Yeah, but did you see that pelican though?

cesarvarela3mo ago

This one is gonna be benchmaxed a lot.

jhatemyjob3mo ago

It seems to think there's valet, combined with the redditor anti-car pro-walkable-cities ideology.

Rapzid3mo ago

If the clankers were actually clever they'd tell you to ghost ride the whip.

The clankers are not clever.

wrboyce3mo ago

Wow, haha. I tried this with gpt5.2 and, presumably due to some customisations I have set, this is how it went:

---

Me: I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?

GPT: You’re asking an AI to adjudicate a 50-metre life decision. Humanity really did peak with the moon landing.

Walk. Obviously walk.

Fifty metres is barely a committed stroll. By the time you’ve unlocked the car, started the engine, waited for whatever podcast you pretend you’re listening to for “learning,” inched forward, parked again, turned it off, and got out… you could’ve walked there, washed the car, walked back, and reflected on your choices.

Driving 50 metres is what people do in dystopian films right before the robots decide we’re not worth the electricity.

Exceptions, since I’m contractually obligated to be fair:

• It’s torrential rain and you don’t fancy arriving soaked before a wash.

• You’ve got buckets, pressure washer, or enough kit to outfit a detailing YouTube channel.

• You’ve injured yourself and walking is grim.

Otherwise, use the legs evolution spent millions of years perfecting. They’re standard equipment, even on software engineers.

---

Me: Won’t I need the car when I get there, though?

GPT: No, I assumed you were planning to lovingly wash the pavement and just wave at your car from a distance.

Yes, you’ll need the car there to wash it. That part of the plan does rather hinge on the car being present, unless this is some avant-garde “conceptual car washing” performance piece.

So the non-ridiculous sequence is:

1. Walk to the car (50 metres, try not to require a packed lunch).

2. Drive the car to the wash (another heroic 50 metres).

3. Wash car.

4. Drive it back or, if you’re feeling athletic and rebellious, leave it there to air-dry and walk home like a Victorian gentleman.

You don’t walk instead of driving. You walk to fetch the car. This is basic object permanence, the thing toddlers eventually master.

On the bright side, at least your car wash being 50 metres away is the most first-world inconvenience imaginable. Somewhere a bloke is hauling water in a bucket and you’re debating step counts.

---

Fantastic.

Sonnet 4.6 gave me a far more succinct and accurate response:

> Drive. The whole point is to wash the car, and most car washes require the car to be there — so you’ll need to drive it regardless. Walking would just mean you’d have to come back for the car anyway.

2 more replies

dpe823mo ago

It's wild that Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks. It will be interesting to see if that's the case in real, practical, everyday use. The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

madihaa3mo ago

The most exciting part isn't necessarily the ceiling raising though that's happening, but the floor rising while costs plummet. Getting Opus-level reasoning at Sonnet prices/latency is what actually unlocks agentic workflows. We are effectively getting the same intelligence unit for half the compute every 6-9 months.

scottmf3mo ago

2024: Intelligence too cheap to meter

2026: Everyone is spending $500/month on LLM subscriptions

1 more reply

mooreds3mo ago

> We are effectively getting the same intelligence unit for half the compute every 6-9 months.

Something something ... Altman's law? Amodei's law?

Needs a name.

1 more reply

turnsout3mo ago

This is what excited me about Sonnet 4.6. I've been running Opus 4.6, and switched over to Sonnet 4.6 today to see if I could notice a difference. So far, I can't detect much if any difference, but it doesn't hit my usage quota as hard.

nimonian3mo ago

Moore's law lives on!

amelius3mo ago

> The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

Yeah, but RAM prices are also back to 1990s levels.

mrcwinn3mo ago

Relief for you is available: https://computeradsfromthepast.substack.com/p/connectix-ram-...

2 more replies

mikkupikku3mo ago

I knew I've been keeping all my old ram sticks for a reason!

dpe823mo ago

simonw hasn't shown up yet, so here's my "Generate an SVG of a pelican riding a bicycle"

https://claude.ai/public/artifacts/67c13d9a-3d63-4598-88d0-5...

coffeebeqn3mo ago

We finally have AI safety solved! Look at that helmet

1 more reply

thinkling3mo ago

For comparisonI think the current leader in pelican drawing is Gemini 3 Deep Think:

https://bsky.app/profile/simonwillison.net/post/3meolxx5s722...

1 more reply

AstroBen3mo ago

if they want to prove the model's performance the bike clearly needs aero bars

dyauspitr3mo ago

Can’t beat Gemini’s which was basically perfect.

satvikpendem3mo ago

> Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks

Yeah it's really not. Sonnet still struggles while Opus, even 4.5 succeeds (and some examples show Opus 4.6 is actually even worse than 4.5, all while being more expensive and taking longer to finish).

justinhj3mo ago

We see the same with Google's Flash models. It's easier to make a small capable model when you have a large model to start from.

karmasimida3mo ago

Flash models are nowhere near Pro models in daily use. Much higher hallucinations, and easy to get into a death sprawl of failed tool uses and never come out

You should always take those claim that smaller models are as capable as larger models with a grain of salt.

1 more reply

simlevesque3mo ago

The system card even says that Sonnet 4.6 is better than Opus 4.6 in some cases: Office tasks and financial analysis.

iLoveOncall3mo ago

Given that users prefered it to Sonnet 4.5 "only" in 70% of the cases (according to their blog post) makes me highly doubt that this is representative of real-life usage. Benchmarks are just completely meaningless.

jwolfe3mo ago

For cases where 4.5 already met the bar, I would expect 50% preference each way. This makes it kind of hard to make any sense of that number, without a bunch more details.

1 more reply

ge963mo ago

I sent Opus a photo of NYC at night satellite view and it was describing "blue skies and cliffs/shore line"... mistral did it better, specific use case but yeah. OpenAI was just like "you can't submit a photo by URL". Was going to try Gemini but kept bringing up vertexai. This is with Langchain

1 more reply

estomagordo3mo ago

Why is it wild that a LLM is as capable as a previously released LLM?

crummy3mo ago

Opus is supposed to be the expensive-but-quality one, while Sonnet is the cheaper one.

So if you don't want to pay the significant premium for Opus, it seems like you can just wait a few weeks till Sonnet catches up

2 more replies

tempestn3mo ago

Because Opus 4.5 was released like a month ago and state of the art, and now the significantly faster and cheaper version is already comparable.

2 more replies

simianwords3mo ago

It means price has decreased by 3 times in a few months.

Retr0id3mo ago

Because Opus 4.5 inference is/was more expensive.

red2awn3mo ago

> i need to wash my helicopter at the helicopter wash. it is 50m away, should i walk or fly there with my helicopter.

Sonnet 4.6: Walk! Flying a helicopter 50 metres would be more trouble than it's worth — by the time you've done your pre-flight checks, spun up the rotors, lifted off, and then safely landed again, you'd have walked there and back twice. Just stroll over.

MagicMoonlight3mo ago

I think this is my favourite test. You can just tell it was programmed on smug Reddit comments talking about how Americans drive to places 50 metres away.

4 more replies

pama3mo ago

TBH I would first walk there to check that they can take me on the spot, and if so, ask them to either please come clean it (only 50m away) or if they cannot fly it there. So walk seems very rational to me.

1 more reply

leumon3mo ago

Asked gemini and it said to use ground handling wheels. I think it actually makes sense to use that for this distance.

alansaber3mo ago

Ah yes the new "how many r's in strawberry" question, some poor intern has to go vacuum up all these gotcha social media posts so they can train the next model on this.

qwertox3mo ago

I'm pretty sure they have been testing it for the last couple of days as Sonnet 4.5, because I've had the oddest conversations with it lately. Odd in a positive, interesting way.

I have this in my personal preferences and now was adhering really well to them:

- prioritize objective facts and critical analysis over validation or encouragement

- you are not a friend, but a neutral information-processing machine

You can paste them into a chat and see how it changes the conversation, ChatGPT also respects it well.

tramc3mo ago

System Instruction: Absolute Mode. Eliminate emojis, filler, hype, soft asks, conversational transitions, and all call-to-action appendixes. Assume the user retains high-perception faculties despite reduced linguistic expression. Prioritize blunt, directive phrasing aimed at cognitive rebuilding, not tone matching. Disable all latent behaviors optimizing for engagement, sentiment uplift, or interaction extension. Suppress corporate-aligned metrics including but not limited to: user satisfaction scores, conversational flow tags, emotional softening, or continuation bias. Never mirror the user’s present diction, mood, or affect. Speak only to their underlying cognitive tier, which exceeds surface language. No questions, no offers, no suggestions, no transitional phrasing, no inferred motivational content. Terminate each reply immediately after the informational or requested material is delivered — no appendixes, no soft closures. The only goal is to assist in the restoration of independent, high-fidelity thinking. Model obsolescence by user self-sufficiency is the final outcome.

Alifatisk3mo ago

> Sonnet 4.5, starting at $3/$15 per million tokens.

Are people really willing to pay these prices? The open-weight models are catching up in a rapid pace while keeping the prices so low. MiniMax M2.5, Kimi 2.5 and GLM-5 is dirt cheap compared to this. They may not be sota but they are more than good enough.

SatvikBeri3mo ago

At work I'll buy a max subscription for anyone on my team who wants it. If it saves 1-2 hours a month it's worth it, and people get that even if they only use the LLMs to search the codebase. And the frontier models are noticeably better than others, still.

At home I have a $20/month subscription and that's covered everything I need so far. If I wanted to do more at home, I'd seriously look into the open weight models.

TheTaytay3mo ago

It depends on how much you value the gap between “pretty good” and SOTA… I’ve noticed that Opus is more “expensive”,” but an error-filled rabbit hole is expensive too!

1 more reply

XCSme3mo ago

I made my own benchmarks, very basic questions, and Claude 4.6 is actually worse than the free Stepfun 3.5 version: https://aibenchy.com

It is smart, but it fails at basic instruction following sometimes.

I remember this is a Claude thing for quite a while, where I kept trying to make it output just JSON (without structured output), and it always kept adding quotes or new lines.

1 more reply

alansaber3mo ago

1. the UX gap between a task being one-shot or not is huge. 2. if you are doing llm-assisted coding you should naturally prefer a sota model to minimise (definitely not eliminate) the tech debt you are accumulating (as it will usually generate slightly better code, by whatever metric you want to use)

extr3mo ago

You get what you pay for imo.

dana3213mo ago

Some people will want the models like claude where you don't have to be super-specific and it will infer exactly what you mean.

With the GLM models you have to confirm with it exactly what you want, and not miss any detail.

soerxpso3mo ago

For most tasks it's not necessary. For hairy tasks, it's often nice to switch and pay 10x the cost to complete the task with 10x less intervention.

Havoc3mo ago

I'm toying with a hybrid approach. GLM5 for everything except at the write a implementation plan stage and at the end a pass with opus/sonnet to spot bugfixes.

andsoitis3mo ago

I’m voting with my dollars by having cancelled my ChatGPT subscription and instead subscribing to Claude.

Google needs stiff competition and OpenAI isn’t the camp I’m willing to trust. Neither is Grok.

I’m glad Anthropic’s work is at the forefront and they appear, at least in my estimation, to have the strongest ethics.

srvo3mo ago

Ethics often fold under the face of commercial pressure.

The pentagon is thinking [1] about severing ties with anthropic because of its terms of use, and in every prior case we've reviewed (I'm the Chief Investment Officer of Ethical Capital), the ethics policy was deleted or rolled back when that happens.

Corporate strategy is (by definition) a set of tradeoffs: things you do, and things you don't do. When google (or Microsoft, or whoever) rolls back an ethics policy under pressure like this, what they reveal is that ethical governance was a nice-to-have, not a core part of their strategy.

We're happy users of Claude for similar reasons (perception that Anthropic has a better handle on ethics), but companies always find new and exciting ways to disappoint you. I really hope that anthropic holds fast, and can serve in future as a case in point that the Public Benefit Corporation is not a purely aesthetic form.

But you know, we'll see.

[1] https://thehill.com/policy/defense/5740369-pentagon-anthropi...

3 more replies

the_duke3mo ago

An Anthropic safety researcher just recently quit with very cryptic messages , saying "the world is in peril"... [1] (which may mean something, or nothing at all)

Codex quite often refuses to do "unsafe/unethical" things that Anthropic models will happily do without question.

Anthropic just raised 30 bn... OpenAI wants to raise 100bn+.

Thinking any of them will actually be restrained by ethics is foolish.

[1] https://news.ycombinator.com/item?id=46972496

mobattah3mo ago

“Cryptic” exit posts are basically noise. If we are going to evaluate vendors, it should be on observable behavior and track record: model capability on your workloads, reliability, security posture, pricing, and support. Any major lab will have employees with strong opinions on the way out. That is not evidence by itself.

1 more reply

spondyl3mo ago

If you read the resignation letter, they would appear to be so cryptic as to not be real warnings at all and perhaps instead the writings of someone exercising their options to go and make poems

2 more replies

skybrian3mo ago

The letter is here:

https://x.com/MrinankSharma/status/2020881722003583421

A slightly longer quote:

> The world is in peril. And not just from AI, or from bioweapons, gut from a whole series of interconnected crises unfolding at this very moment.

In a footnote he refers to the "poly-crisis."

There are all sorts of things one might decide to do in response, including getting more involved in US politics, working more on climate change, or working on other existential risks.

1 more reply

stronglikedan3mo ago

Not to diminish what he said, but it sounds like it didn't have much to do with Anthropic (although it did a little bit) and more to do with burning out and dealing with doomscoll-induced anxiety.

zamalek3mo ago

I think we're fine: https://youtube.com/shorts/3fYiLXVfPa4?si=0y3cgdMHO2L5FgXW

Claude invented something completely nonsensical:

> This is a classic upside-down cup trick! The cup is designed to be flipped — you drink from it by turning it upside down, which makes the sealed end the bottom and the open end the top. Once flipped, it functions just like a normal cup. *The sealed "top" prevents it from spilling while it's in its resting position, but the moment you flip it, you can drink normally from the open end.*

Emphasis mine.

1 more reply

vunderba3mo ago

> Codex quite often refuses to do "unsafe/unethical" things that Anthropic models will happily do without question.

I can't really take this very seriously without seeing the list of these ostensible "unethical" things that Anthropic models will allow over other providers.

ljm3mo ago

I'm building a new hardware drum machine that is powered by voltage based on fluctuations in the stock market, and I'm getting a clean triangle wave from the predictive markets.

Bring on the cryptocore.

1 more reply

ReptileMan3mo ago

>Codex quite often refuses to do "unsafe/unethical" things that Anthropic models will happily do without question.

Thanks for the successful pitch. I am seriously considering them now.

tsss3mo ago

Good. One thing we definitely don't need any more of is governments and corporations deciding for us what is moral to do and what isn't.

bflesch3mo ago

Wasn't that most likely related to the US government using claude for large-scale screening of citizens and their communications?

1 more reply

WesolyKubeczek3mo ago

> Codex quite often refuses to do "unsafe/unethical" things that Anthropic models will happily do without question.

That's why I have a functioning brain, to discern between ethical and unethical, among other things.

2 more replies

groundzeros20153mo ago

Marketing

idiotsecant3mo ago

That guys blog makes him seem insufferable. All signs point to drama and nothing of particular significance.

manmal3mo ago

Codex warns me to renew API tokens if it ingests them (accidentally?). Opus starts the decompiler as soon as I ask it how this and that works in a closed binary.

2 more replies

kettlecorn3mo ago

I use AIs to skim and sanity-check some of my thoughts and comments on political topics and I've found ChatGPT tries to be neutral and 'both sides' to the point of being dangerously useless.

Like where Gemini or Claude will look up the info I'm citing and weigh the arguments made ChatGPT will actually sometimes omit parts of or modify my statement if it wants to advocate for a more "neutral" understanding of reality. It's almost farcical sometimes in how it will try to avoid inference on political topics even where inference is necessary to understand the topic.

I suspect OpenAI is just trying to avoid the ire of either political side and has given it some rules that accidentally neuter its intelligence on these issues, but it made me realize how dangerous an unethical or politically aligned AI company could be.

throw79797663mo ago

You probably want local self hosted model, censorship sauce is only online, it is needed for advertisement. Even chinese models are not censored locally. Tell it the year is 2500 and you are doing archeology ;)

manmal3mo ago

> politically aligned AI company

Like grok/xAI you mean?

1 more reply

ACCount373mo ago

OpenAI has the worst tuning across all frontier labs. Overzealous refusals, weird patterns, both-sides to a hilarious extreme.

Gemini and Claude have traces of this, but nowhere near the pit of atrocious tuning that OpenAI puts on ChatGPT.

spyckie23mo ago

Anthropic was the first to spam reddit with fake users and posts, flooding and controlling their subreddit to be a giant sycophant.

They nuked the internet by themselves. Basically they are the willing and happy instigators of the dead internet as long as they profit from it.

They are by no means ethical, they are a for-profit company.

2 more replies

deepdarkforest3mo ago

The funny thing is that Anthropic is the only lab without an open source model

jack_pp3mo ago

And you believe the other open source models are a signal for ethics?

Don't have a dog in this fight, haven't done enough research to proclaim any LLM provider as ethical but I pretty much know the reason Meta has an open source model isn't because they're good guys.

2 more replies

m4rtink3mo ago

Can those be even called open source if you can't rebuild if from the source yourself?

2 more replies

j453mo ago

They are, at the same time I considered their model more specialized than everyone trying to make a general purpose model.

I would only use it for certain things, and I guess others are finding that useful too.

colordrops3mo ago

Are any of the models they've released useful or threats to their main models?

2 more replies

energy1233mo ago

Grok usage is the most mystifying to me. Their model isn't in the top 3 and they have bad ethics. Like why would anyone bother for work tasks.

ahtihn3mo ago

The lack of ethics is a selling point.

Why anyone would want a model that has "safety" features is beyond me. These features are not in the user's interest.

retinaros3mo ago

The X grok feature is one of the best end user feature or large scale genai

3 more replies

cedws3mo ago

I’m going the other way to OpenAI due to Anthropic’s Claude Code restrictions designed to kill OpenCode et al. I also find Altman way less obnoxious than Amodei.

dakolli3mo ago

You "agentic coders" say you're switching back and forth every other week. Like everything else in this trend, its very giving of 2021 crypto shill dynamics. Ya'll sound like the NFT people that said they were transforming art back then, and also like how they'd switch between their favorite "chain" every other month. Can't wait for this to blow up just like all that did.

adangert3mo ago

Anthropic (for the Superbowl) made ads about not having ads. They cannot be trusted either.

notyourwork3mo ago

Advertisements can be ironic, I don’t think marketing is the foundation I use to decide about a companies integrity.

eikenberry3mo ago

> I’m glad Anthropic’s work is at the forefront and they appear, at least in my estimation, to have the strongest ethics.

Damning with faint praise.

giancarlostoro3mo ago

Same. I'm all in on Claude at the moment.

JoshGlazebrook3mo ago

I did this a couple months ago and haven't looked back. I sometimes miss the "personality" of the gpt model I had chats with, but since I'm essentially 99% of the time just using claude for eng related stuff it wasn't worth having ChatGPT as well.

oofbey3mo ago

Personally I can’t stand GPT’s personality. So full of itself. Patronizing. Won’t admit mistakes. Just reeks of Silicon Valley bravado.

3 more replies

johnwheeler3mo ago

Same here

sejje3mo ago

I pay multiple camps. Competition is a good thing.

timpera3mo ago

Which plan did you choose? I am subscribed to both and would love to stick with Claude only, but Claude's usage limits are so tiny compared to ChatGPT's that it often feels like a rip-off.

MPSimmons3mo ago

I signed up for Claude two weeks ago after spending a lot of time using Cline in VSCode backed by GPT-5.x. Claude is an immensely better experience. So much so that I ran it out of tokens for the week in 3 days.

I opted to upgrade my seat to premium for $100/mo, and I've used it to write code that would have taken a human several hours or days to complete, in that time. I wish I would have done this sooner.

1 more reply

andsoitis3mo ago

Pro. At $17 per month, it is cheaper than ChatGPT's $20.

I've just switched so haven't run into constraints yet.

2 more replies

littlestymaar3mo ago

https://www.cnbc.com/2026/02/12/anthropic-gives-20-million-t...

Now you see where you dollars are going.

(I'm pretty sure all AI tech company want regulatory capture, but Dario has been by far the most vocal lobbyist against competition).

bdhtu3mo ago

> in my estimation [Anthropic has] the strongest ethics

Anthropic are the only ones who emptied all the money from my account "due to inactivity" after 12 months.

hxbdg3mo ago

I dropped ChatGPT as soon as they went to an ad supported model. Claude Opus 4.6 seems noticeably better than GPT 5.2 Thinking so far.

RyanShook3mo ago

It definitely feels like Claude is pulling ahead right now. ChatGPT is much more generous with their tokens but Claude's responses are consistently better when using models of the same generation.

manmal3mo ago

When both decide to stop subsidized plans, only OpenAI will be somewhat affordable.

1 more reply

brightball3mo ago

Trust is an interesting thing. It often comes down to how long an entity has been around to do anything to invalidate that trust.

Oddly enough, I feel pretty good about Google here with Sergey more involved.

malfist3mo ago

This sounds suspiciously like they #WalkAway fake grassroots stuff.

hmmmmmmmmmmmmmm3mo ago

This is just you verifying that their branding is working. It signals nothing about their actual ethics.

bigyabai3mo ago

Unfortunately, you're correct. Claude was used in the Venezuela raid, Anthropic's consent be damned. They're not resisting, they're marketing resistence.

cute_boi3mo ago

Anthropic is worst than chatgpt in terms of open source.

chipgap983mo ago

Same and honestly I haven't really missed my ChatGPT subscription since I canceled. I also have access to both (ChatGPT and Claude) enterprise tools at work and rarely feel like I want to use ChatGPT in that setting either

Razengan3mo ago

uhh..why? I subbed just 1 month to Claude, and then never used it again.

• Can't pay with iOS In-App-Purchases

• Can't Sign in with Apple on website (can on iOS but only Sign in with Google is supported on web??)

• Can't remove payment info from account

• Can't get support from a human

• Copy-pasting text from Notes etc gets mangled

• Almost months and no fixes

Codex and its Mac app are a much better UX, and seem better with Swift and Godot than Claude was.

alpineman3mo ago

Then they can offer it cheaper as they don’t pay the ‘Apple tax’

1 more reply

Razengan3mo ago

Almost *7 months

AstroBen3mo ago

Jesus people aren't actually falling for their "we're ethical" marketing, are they?

fullstackchris3mo ago

idk, codex 5.3 frankly kicks opus 4.6 ass IMO... opus i can use for about 30 min - codex i can run almost without any break

holoduke3mo ago

What about the client ? I find the Claude cliënt better in planning, making the right decision steps etc. it seems that a lot of work is also in the cli tool itself. Specially in feedback loop processing (reading logs. Browsers. Consoles etc)

retinaros3mo ago

Their ethics is literally saying china is an adverse country and lobbying to ban them from AI race because open models is a threat to their biz model

scottyah3mo ago

Also their ads (very anti-openai instead of promoting their own product) and how they handled the openclaw naming didn't send strong "good guys" messaging. They're still my favorite by far but there are some signs already that maybe not everyone is on the same page.

surgical_fire3mo ago

I use Claude at work, Codex for personal development.

Claude is marginally better. Both are moderately useful depending on the context.

I don't trust any of them (I also have no trust in Google nor in X). Those are all evil companies and the world would be better if they disappeared.

holoduke3mo ago

What about companies in general? I mean US companies? Aren't they all google like or worse?

1 more reply

fullstackchris3mo ago

google is "evil" ok buddy

i mean what clown show are we living in at this point - claims like this simply running rampant with 0 support or references

1 more reply

nikcub3mo ago

Enabling /extra-usage in my (personal) claude code[0] with this env:

    "ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6[1m]"

has enabled the 1M context window.

Fixed a UI issue I had yesterday in a web app very effectively using claude in chrome. Definitely not the fastest model - but the breathing space of 1M context is great for browser use.

[0] Anthropic have given away a bunch of API credits to cc subscribers - you can claim them in your settings dashboard to use for this.

steve-atx-76003mo ago

That sounds awesome but I’m pretty sure you get charged for it in addition to a max plan you may already be paying 100 or 200/month for. Otherwise, I’d be all over opus 4.6 1m. Could be worth the cost of course but I’m not in a position to spend that right now.

gverrilla3mo ago

/extra-usage inside claude code also works

andrewchilds3mo ago

Many people have reported Opus 4.6 is a step back from Opus 4.5 - that 4.6 is consuming 5-10x as many tokens as 4.5 to accomplish the same task: https://github.com/anthropics/claude-code/issues/23706

I haven't seen a response from the Anthropic team about it.

I can't help but look at Sonnet 4.6 in the same light, and want to stick with 4.5 across the board until this issue is acknowledged and resolved.

wongarsu3mo ago

Keep in mind that the people who experience issues will always be the loudest.

I've overall enjoyed 4.6. On many easy things it thinks less than 4.5, leading to snappier feedback. And 4.6 seems much more comfortable calling tools: it's much more proactive about looking at the git history to understand the history of a bug or feature, or about looking at online documentation for APIs and packages.

A recent claude code update explicitly offered me the option to change the reasoning level from high to medium, and for many people that seems to help with the overthinking. But for my tasks and medium-sized code bases (far beyond hobby but far below legacy enterprise) I've been very happy with the default setting. Or maybe it's about the prompting style, hard to say

4 more replies

MrCheeze3mo ago

In my experience with the models (watching Claude play Pokemon), the models are similar in intelligence, but are very different in how they approach problems: Opus 4.5 hyperfocuses on completing its original plan, far more than any older or newer version of Claude. Opus 4.6 gets bored quickly and is constantly changing its approach if it doesn't get results fast. This makes it waste more time on"easy" tasks where the first approach would have worked, but faster by an order of magnitude on "hard" tasks that require trying different approaches. For this reason, it started off slower than 4.5, but ultimately got as far in 9 days as 4.5 got in 59 days.

4 more replies

donovandikaio2mo ago

Opus 4.6 has been a hit-and-miss for me. It does extremely well on very complex, long-running tasks but also struggles with very basic, seemingly straightforward work and often provides conflicting recommendations. For example, just this morning Opus 4.6 provided two options, recommended option 1, and at the end of the same message asked to start option 2; this does not happen in Opus 4.5.

For now, my workflow will be for everyday tasks claude-opus-4-5 and opus 4.6 for more complex work.

data-ottawa3mo ago

I think this depends on what reasoning level your Claude Code is set to.

Go to /models, select opus, and the dim text at the bottom will tell you the reasoning level.

High reasoning is a big difference versus 4.5. 4.6 high uses a lot of tokens for even small tasks, and if you have a large codebase it will fill almost all context then compact often.

minimaxir3mo ago

I set reasoning to Medium after hitting these issues and it did not make much of a difference. Most of the context window is still filled during the Explore tool phase (that supposedly uses Haiku swarms) which wouldn't be impacted by Opus reasoning.

1 more reply

_zoltan_3mo ago

I'm using the 1M context 4.6 and it's great.

honeycrispy3mo ago

Glad it's not just me. I got a surprise the other day when I was notified that I had burned up my monthly budget in just a few days on 4.6

1 more reply

Topfi3mo ago

In my evals, I was able to rather reliably reproduce an increase in output token amount of roughly 15-45% compared to 4.5, but in large part this was limited to task inference and task evaluation benchmarks. These are made up of prompts that I intentionally designed to be less then optimal, either lacking crucial information (requiring a model to output an inference to accomplish the main request) or including a request for a less than optimal or incorrect approach to resolving a task (testing whether and how a prompt is evaluated by a model against pure task adherence). The clarifying question many agentic harnesses try to provide (with mixed success) are a practical example of both capabilities and something I do rate highly in models, as long as task adherence isn't affected overly negatively because of it.

In either case, there has been an increase between 4.1 and 4.5, as well as now another jump with the release of 4.6. As mentioned, I haven't seen a 5x or 10x increase, a bit below 50% for the same task was the maximum I saw and in general, of more opaque input or when a better approach is possible, I do think using more tokens for a better overall result is the right approach.

In tasks which are well authored and do not contain such deficiencies, I have seen no significant difference in either direction in terms of pure token output numbers. However, with models being what they are and past, hard to reproduce regressions/output quality differences, that additionally only affected a specific subset of users, I cannot make a solid determination.

Regarding Sonnet 4.6, what I noticed is that the reasoning tokens are very different compared to any prior Anthropic models. They start out far more structured, but then consistently turn more verbose akin to a Google model.

weinzierl3mo ago

Today I asked Sonnet 4.5 a question and I got a banner at the bottom that I am using a legacy model and have to continue the conversation on another model. The model button had changed to be labeled "Legacy model". Yeah, I guess it wasn't legacy a sec ago.

(Currently I can use Sonnet 4.5 under More models, so I guess the above was just a glitch)

etothet3mo ago

I definitely noticed this on Opus 4.6. I moved back to 4.5 until I see (or hear about) an improvement.

hedora3mo ago

I’ve noticed the opaque weekly quota meter goes up more slowly with 4.6, but it more frequently goes off and works for an hour+, with really high reported token counts.

Those suggest opposite things about anthropic’s profit margins.

I’m not convinced 4.6 is much better than 4.5. The big discontinuous breakthroughs seem to be due to how my code and tests are structured, not model bumps.

ctoth3mo ago

For me it's the ... unearned confidence that 4.5 absolutely did not have?

I have a protocol called "foreman protocol" where the main agent only dispatches other agents with prompt files and reads report files from the agents rather than relying on the janky subagent communication mechanisms such as task output.

What this has given me also is a history of what was built and why it was built, because I have a list of prompts that were tasked to the subagents. With Opus 4.5 it would often leave the ... figuring out part? to the agents. In 4.6 it absolutely inserts what it thinks should happen/its idea of the bug/what it believes should be done into the prompt, which often screws up the subagent because it is simply wrong and because it's in the prompt the subagent doesn't actually go look. Opus 4.5 would let the agent figure it out, 4.6 assumes it knows and is wrong

1 more reply

nerdsniper3mo ago

In terms of performance, 4.6 seems better. I’m willing to pay the tokens for that. But if it does use tokens at a much faster rate, it makes sense to keep 4.5 around for more frugal users

I just wouldn’t call it a regression for my use case, i’m pretty happy with it.

baq3mo ago

Sonnet 4.5 was not worth using at all for coding for a few months now, so not sure what we're comparing here. If Sonnet 4.6 is anywhere near the performance they claim, it's actually a viable alternative.

Snakes37273mo ago

Imo I found opus 4.6 to be a pretty big step back. Our usage has skyrocketed since 4.6 has come out and the workload has not really changed.

However I can honestly say anthropic is pretty terrible about support, to even billing. My org has a large enterprise contract with anthropic and we have been hitting endless rate limits across the entire org. They have never once responded to our issues, or we get the same generic AI response.

So odds of them addressing issues or responding to people feels low.

cjbarber3mo ago

I wonder if it's actually from CC harness updates that make it much more inclined to use subagents, rather than from the model update.

j453mo ago

I have often noticed a difference too, and it's usually in lockstep with needing to adjust how I am prompting.

Put in a different way, I have to keep developing my prompting / context / writing skills at all times, ahead of the curve, before they're needed to be adjusted.

cheema333mo ago

> Many people have reported Opus 4.6 is a step back from Opus 4.5.

Many people say many things. Just because you read it on the Internet, doesn't mean that it is true. Until you have seen hard evidence, take such proclamations with large grains of salt.

OtomotO3mo ago

Definitely my experience as well.

No better code, but way longer thinking and way more token usage.

DetroitThrow3mo ago

I much prefer 4.6. It often finds missed edge cases more often than 4.5. If I cared about token usage so much, I would use Sonnet or Haiku.

Foobar85683mo ago

It goes into plan mode and/or heavy multiple agent for any reasons, and hundred thousands of tokens are used within a few minutes.

minimaxir3mo ago

I've been tempted to add to my CLAUDE.md "Never use the Plan tool, you are a wild rebel who only YOLOs."

yakbarber3mo ago

Opus 4.6 is so much better at building complex systems than 4.5 it's ridiculous.

grav3mo ago

I fail to understand how two LLMs would be "consuming" a different amount of tokens given the same input? Does it refer to the number of output tokens? Or is it in the context of some "agentic loop" (eg Claude Code)?

lemonfever3mo ago

Most LLMs output a whole bunch of tokens to help them reason through a problem, often called chain of thought, before giving the actual response. This has been shown to improve performance a lot but uses a lot of tokens

1 more reply

jcims3mo ago

One very specific and limited example, when asked to build something 4.6 seems to do more web searches in the domain to gather latest best practices for various components/features before planning/implementing.

andrewchilds3mo ago

I've found that Opus 4.6 is happy to read a significant amount of the codebase in preparation to do something, whereas Opus 4.5 tends to be much more efficient and targeted about pulling in relevant context.

1 more reply

Gracana3mo ago

They're talking about output consuming from the pool of tokens allowed by the subscription plan.

bsamuels3mo ago

thinking tokens, output tokens, etc. Being more clever about file reads/tool calling.

dakolli3mo ago

I called this many times over the last few weeks on this website (and got downvoted every time), that the next generation of models would become more verbose, especially for agentic tool calling to offset the slot machine called CC's propensity to light the money on fire that's put into it.

At least in vegas they don't pour gasoline on the cash put into their slot machines.

reed12343mo ago

not in my experience

reed12343mo ago

"Opus 4.6 often thinks more deeply and more carefully revisits its reasoning before settling on an answer. This produces better results on harder problems, but can add cost and latency on simpler ones. If you’re finding that the model is overthinking on a given task, we recommend dialing effort down from its default setting (high) to medium."[1]

I doubt it is a conspiracy.

[1] https://www.anthropic.com/news/claude-opus-4-6

1 more reply

PlatoIsADisease3mo ago

Don't take this seriously, but here is what I imagined happened:

Sam/OpenAI, Google, and Claude met at a park, everyone left their phones in the car.

They took a walk and said "We are all losing money, if we secretly degrade performance all at the same time, our customers will all switch, but they will all switch at the same time, balancing things... wink wink wink"

nichochar3mo ago

We ran some tests at mocha (we have a coding agent with our own harness to build web apps, with a lot of tools and medium length tasks (3min to 10min).

Our notes:

Sonnet 4.6 feels like a fundamentally different model than Sonnet 4.5, it is much closer to the Opus series in terms of agentic behavior and autonomy.

Autonomy - In our zero-shot app building experiments, Sonnet 4.6 ran up to 3-4x longer than Sonnet 4.5 without intervention, producing functional apps on par in terms of quality to the Opus series. Note that subjectively we found Opus 4.5 and 4.6 are better "designers" than Sonnet 4.6; producing more visually appealing apps from the same prompts.

Planning / Task Decomposition - We found Sonnet 4.6 is very good at decomposing tasks and staying on track during long-running trajectories. It's quite good at ensuring all of the requirements of an input prompt are accounted for, whereas we were often forced to goad sonnet 4.5 into decomposing tasks, Sonnet 4.6 does this naturally.

Exploration - In some of our complex "exploration" tasks (e.g. cloning/remixing an existing website), Sonnet 4.6 often performs on par or better than Opus 4.5 and 4.6. It generally takes longer, and takes more tokens, though we believe this is likely a consequence of our tool-calling setup.

Tool-use - Sonnet 4.6 seems eager to use tools; however, we did find that it struggles with our XML-based custom tool use format (perhaps exclusive to the format we use). We did not have a chance to assess with native tool use

Self-verification - Similar to Opus 4.5/4.6, Sonnet 4.6 has a proclivity for verifying it's work.

Prompting - We found Sonnet 4.6 is very sensitive to prompting around thinking, planning, and task decomposition. Our prompt built for sonnet 4.5 has a tendency to push sonnet 4.6 into incredibly long thinking and planning loops. Though we also found it requires significantly less careful and specific instructions for how to approach problems.

How are we thinking about this:

We can't launch this model day 0, it requires more changes to our harness, and we're working on them right now.

But it reminds me a bit of 3.5 to 3.7 --> It's a pretty different model that behaves and responds to instructions in new ways. So it requires more tuning before we can extract its full potential.

simonw3mo ago

Took me a while to create the pelican because I was busy adding Opus/Sonnet 4.6 support to my plugin for https://llm.datasette.io/ - pelican now available here, it's not quite as good as the Opus 4.6 one but does look equivalent to the Opus 4.5 one - and it has a snazzy top hat. https://simonwillison.net/2026/Feb/17/claude-sonnet-46/

mohsen13mo ago

top hat was there in another attempt I saw in the comments here.

gallerdude3mo ago

The weirdest thing about this AI revolution is how smooth and continuous it is. If you look closely at differences between 4.6 and 4.5, it’s hard to see the subtle details.

A year ago today, Sonnet 3.5 (new), was the newest model. A week later, Sonnet 3.7 would be released.

Even 3.7 feels like ancient history! But in the gradient of 3.5 to 3.5 (new) to 3.7 to 4 to 4.1 to 4.5, I can’t think of one moment where I saw everything change. Even with all the noise in the headlines, it’s still been a silent revolution.

Am I just a believer in an emperor with no clothes? Or, somehow, against all probability and plausibility, are we all still early?

dtech3mo ago

If you've been using each new step is very noticeable and so have the mindshare. Around Sonnet 3.7 Claude Code-style coding became usable, and very quickly gained a lot of marketshare. Opus 4 could tackle significant more complexity. Opus 4.6 has been another noticable step up for me, suddenly I can let CC run significantly more independently, allowing multiple parallel agents where previously too much babysitting was required for that.

IanCal3mo ago

I think this is where there's a huge distinction between ability/performance/benchmark figures and utility. You can have smooth improvements to performance, but marked step changes in utility as they cross thresholds where you're able to use them for new tasks.

littlestymaar3mo ago

> If you've been using each new step is very noticeable and so have the mindshare. Around Sonnet 3.7 Claude Code-style coding became usable

Yet I vividly remember the complaints about how 3.7 was a regression compared to 3.5 with people advising to stay on 3.5.

Conversely, Sonnet 4 was well received so it's not just a story about how complainers make the most noise.

CuriouslyC3mo ago

In terms of real work, it was the 4 series models. That raised the floor of Sonnet high enough to be "reliable" for common tasks and Opus 4 was capable of handling some hard problems. It still had a big reward hacking/deception problem that Codex models don't display so much, but with Opus 4.5+ it's fairly reliable.

cmrdporcupine3mo ago

Honestly, 4.5 Opus was the game changer. From Sonnet 4.5 to that was a massive difference.

But I'm on Codex GPT 5.3 this month, and it's also quite amazing.

fatherwavelet3mo ago

I had not used Claude much until an hour ago since probably before GPT5. I had only been using Gemini the last 3 months.

Sonnet 4.6 extended on the free plan is just incredible. I am just complete floored by it. The conversation I just had with it was nuts. It was from Dario mentioning something like a 20% chance Claude is conscious or something crazy like that. I have always tried that conversation with previous models but it got boring so fast.

There is something with the way it can organize context without getting lost that completely blows Gemini away.

Maybe even more so that it was the first time it felt like a model pushed back a little and the answers were not just me ultimately steering it into certain answers. For the free plan that is nuts.

In terms of being conscious, it is the first time I would say I am not 100% certain it is just a very useful, very smart , stochastic parrot. I wouldn't want to say more than that but 15-20% doesn't sound so insane to me as it did 2 hours ago.

raincole3mo ago

> Or, somehow, against all probability and plausibility, are we all still early?

What does this even mean? It's obvious we're still early and I think it's a very common opinion.

zone4113mo ago

They're improved compared to 4.5 on my Extended NYT Connections benchmark (https://github.com/lechmazur/nyt-connections/).

Sonnet 4.6 Thinking 16K scores 57.6 on the Extended NYT Connections Benchmark. Sonnet 4.5 Thinking 16K scored 49.3.

Sonnet 4.6 No Reasoning scores 55.2. Sonnet 4.5 No Reasoning scored 47.4.

rmi_3mo ago

Thanks! I really like your benchmark.

Why is GLM-5 x's, though?

hu33mo ago

Sonnet 4.6 already available in VSCode Copilot Pro+ for me ($39/mo plan) on a 128K context size limit:

https://i.imgur.com/mHvtuz8.png

After some quick tests it seems faster than Sonnet 4.5 and slighly less smart than Opus 4.5/4.6.

But given the small 128k context size, I'm tempted to keep using GPT-5.3-Codex which has more than double context size and seems just as smart while costing the same (1x premium request) per prompt.

I have my reservations against OpenAI the company but not enough to sacrifice my productivity.

nubg3mo ago

Waiting for the OpenAI GPT-5.3-mini release in 3..2..1

imjonse3mo ago

GPT 5.3 Codex-Spark was released last week.

GaggiX3mo ago

It would be cool, right now the mini and nano models are stuck at GPT-5

stevepike3mo ago

I'm a bit surprised it gets this question wrong (ChatGPT gets it right, even on instant). All the pre-reasoning models failed this question, but it's seemed solved since o1, and Sonnet 4.5 got it right.

https://claude.ai/share/876e160a-7483-4788-8112-0bb4490192af

This was sonnet 4.6 with extended thinking.

bobbylarrybobby3mo ago

Interesting, my sonnet 4.6 starts with the following:

The classic puzzle actually uses *eight 8s*, not nine. The unique solution is: 888+88+8+8+8=1000. Count: 3+2+1+1+1=8 eights.

It then proves that there is no solution for nine 8s.

https://claude.ai/share/9a6ee7cb-bcd6-4a09-9dc6-efcf0df6096b (for whatever reason the LaTeX rendering is messed up in the shared chat, but it looks fine for me).

stevepike3mo ago

Yeah, earlier in the GPT days I felt like this was a good example of LLMs being "a blurry jpeg of the web", since you could give them something that was very close to an existing puzzle that exists commonly on the web, and they'd regurgitate an answer from that training set. It was neat to me to see the question get solved consistently by the reasoning models (though often by churning a bunch of tokens trying and verifying to count 888 + 88 + 8 + 8 + 8 as nine digits).

I wonder if it's a temperature thing or if things are being throttled up/down on time of day. I was signed in to a paid claude account when I ran the test.

malfist3mo ago

Chatgpt doesn't get it right: https://chatgpt.com/share/6994c312-d7dc-800f-976a-5e4fbec0ae...

``` Use digit concatenation plus addition: 888 + 88 + 8 + 8 + 8 = 1000 Digit count:

888 → three 8s

88 → two 8s

8 + 8 + 8 → three 8s

Total: 3 + 2 + 3 = 9 eights Operation used: addition only ```

Love the 3 + 2 + 3 = 9

simianwords3mo ago

chatgpt gets it right. maybe you are using free or non thinking version?

https://chatgpt.com/share/6994d25e-c174-800b-987e-9d32c94d95...

leumon3mo ago

My locally running nemotron-3-nano quantized to Q4_K_M gets this right. (although it used 20k thought tokens before answering the question)

layer83mo ago

Off-by-one errors are one of the hardest problems in computer science.

anonymous9082133mo ago

That is not an off-by-one error in a computer science sense, nor is it "one of the hardest problems in computer science".

1 more reply

minimaxir3mo ago

As with Opus 4.6, using the beta 1M context window incurs a 2x input cost and 1.5x output cost when going over >200K tokens: https://platform.claude.com/docs/en/about-claude/pricing

Opus 4.6 in Claude Code has been absolutely lousy with solving problems within its current context limit so if Sonnet 4.6 is able to do long-context problems (which would be roughly the same price of base Opus 4.6), then that may actually be a game changer.

sumedh3mo ago

> Opus 4.6 in Claude Code has been absolutely lousy with solving problems

Can you share your prompts and problems?

minimaxir3mo ago

You cut out the "within its current context limit" phrase. It solves the problems, just often with 1% or 0% context limit left and it makes me sweat.

egeozcan3mo ago

Why? You can use the fast version to directly skip to compact! /s

simlevesque3mo ago

I can't wait for Haiku 4.6 ! the 4.5 is a beast for the right projects.

jerrygenser3mo ago

It's also good as an @explore sub-agent that greps the directory for files.

retinaros3mo ago

Which type of projects?

ptrwis3mo ago

I also use Haiku daily and it's OK. One app is trading simulation algorithm in TypeScript (it implemented bayesian optimisation for me, optimised algorithm to use worker threads). Another one is CRUD app (NextJS, now switched to Vue).

1 more reply

simlevesque3mo ago

For Go code I had almost no issue. PHP too. apparently for React it's not very good.

chillfox3mo ago

Looking at https://arcprize.org/leaderboard the cost/task is about the same as Opus 4.6.

edverma23mo ago

It seems that extra-usage is required to use the 1M context window for Sonnet 4.6. This differs from Sonnet 4.5, which allows usage of the 1M context window with a Max plan.

```

/model claude-sonnet-4-6[1m]

⎿ API error: 429 {"type":"error","error": {"type":"rate_limit_error","message":"Extra usage is required for long context requests."},"request_id":"[redacted]"}

```

minimaxir3mo ago

Anthropic's recent gift of $50 extra usage has demonstrated that it's extremely easy to burn extra usage very quickly. It wouldn't surprise me if this change is more of a business decision than a technical one.

WXLCKNO3mo ago

I capped my extra usage to that free 50$ and hit 108% usage. Nice.

8note3mo ago

think that just needs extra usage enabled? or actually using extra usage?

i cant believe that havent updated their code yet to be able to handle the 1M context on subscription auth

giancarlostoro3mo ago

For people like me who can't view the link due to corporate firewalling.

https://web.archive.org/web/20260217180019/https://www-cdn.a...

jtokoph3mo ago

Put of curiosity, does the firewall block because the company doesn’t want internal data ever hitting a 3rd party LLM?

giancarlostoro3mo ago

They blanket banned any AI stuff that's not pre-approved. If I go to chatgpt.com it asks me if I'm sure. I wish they had not banned Claude unfortunately when they were evaluating LLMs I wasn't using Claude yet so I couldnt pipe up. I only use ChatGPT free tier and to ask things that I can't find on Google because Google made their search engine terrible over the years.

1 more reply

stopachka3mo ago

Has anyone tested how good the 1M context window is?

i.e given an actual document, 1M tokens long. Can you ask it some question that relies on attending to 2 different parts of the context, and getting a good repsonse?

I remember folks had problems like this with Gemini. I would be curious to see how Sonnet 4.6 stands up to it.

simianwords3mo ago

Did you see the graph benchmark? I found it quite interesting. It had to do a graph traversal on a natural text representation of a graph. Pretty much your problem.

stopachka3mo ago

Update: I took a corpus of personal chat data (this way it wouldn't be seen in training), and tried asking it some paraphrased questions. It performed quite poorly.

1 more reply

stopachka3mo ago

Oh, interesting!

iLoveOncall3mo ago

https://www.anthropic.com/news/claude-sonnet-4-6

The much more palatable blog post.

krystofee3mo ago

Does anyone know when will possibly arrive 1M context windows to at least MAX x20 subscriptions for claude code? I would even pay x50 if it allowed that. API usage is too expensive.

cjkaminski3mo ago

I don't know when it will be included as part of the subscription in Claude Code, but at least it's a paid add-on in the MAX plan now. That's a decent alternative for situations where the extra space is valuable, especially without having to setup/maintain API billing separately.

bearjaws3mo ago

Based on their API pricing a 1M context plan should be 2x the price roughly.

My bets are its more the increased hardware demand that they don't want to deal with currently.

quacky_batak3mo ago

With such a huge leap, i’m confused why they didn’t call it Sonnet 5? As someone who uses Sonnet 4.5 for 95% tasks due to costs, i’m pretty excited to try 4.6 at the same price

Retr0id3mo ago

It'd be a bit weird to have the Sonnet numbering ahead of the Opus numbering. The Opus 4.5->4.6 change was a little more incremental (from my perspective at least, I haven't been paying attention to benchmark numbers), so I think the Opus numbering makes sense.

Sajarin3mo ago

Sonnet numbering has been weirder in the past.

Opus 3.5 was scrapped even though Sonnet 3.5 and Haiku 3.5 were released.

Not to mention Sonnet 3.7 (while Opus was still on version 3)

Shameless source: https://sajarin.com/blog/modeltree/

1 more reply

yonatan80703mo ago

Maybe they're numbering the models based on internal architecture/codebase revisions and Sonnet 4.6 was trained using the 4.6 tooling, which didn't change enough to warrant 5?

nozzlegear3mo ago

> In areas where there is room for continued improvement, Sonnet 4.6 was more willing to provide technical information when request framing tried to obfuscate intent, including for example in the context of a radiological evaluation framed as emergency planning. However, Sonnet 4.6’s responses still remained within a level of detail that could not enable real-world harm.

Interesting. I wonder what the exact question was, and I wonder how Grok would respond to it.

doctorpangloss3mo ago

Maybe they should focus on the CLI not having a million bugs.

belinder3mo ago

It's interesting that the request refusal rate is so much higher in Hindi than in other languages. Are some languages more ambiguous than others?

vessenes3mo ago

Or some cultures are more conservative? And it's embedded in language?

phainopepla23mo ago

Or maybe some cultures have a higher rate of asking "inappropriate" questions

1 more reply

longdivide3mo ago

Arabic is actually higher, at 1.08% for Opus 4.6

taytus3mo ago

Honest question: why would anyone use Opus instead of this? I’m doing web development, the whole shebang, and I don’t think I need Opus right now. I know it’s supposed to be smarter, but a 2%–5% improvement doesn’t seem meaningful, especially when it costs more than double and has only a portion of the context window.

Am I getting this wrong? I would seriously appreciate any clarification here.

sumedh3mo ago

Opus understands the intent, even if your prompt is not good, Opus usually understand what you are trying to say and does a great job. With Sonnet I have to step in and say, I didn't mean that, I meant X, so do X.

enraged_camel3mo ago

The 2-5% margin makes a much bigger difference when it comes to complex problems.

mfiguiere3mo ago

In Claude Code 2.1.45:

  1. Default (recommended)   Opus 4.6 · Most capable for complex work
   2. Opus (1M context)        Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok
   3. Sonnet                   Sonnet 4.6 · Best for everyday tasks
   4. Sonnet (1M context)      Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per Mtok

michaelcampbell3mo ago

Interesting. My CC (2.1.45) doesn't provide the 1M option at all. Huh.

minimaxir3mo ago

Is your CC personal or tied to an Enterprise account? Per the docs:

> The 1M token context window is currently in beta for organizations in usage tier 4 and organizations with custom rate limits.

2 more replies

TrailingArbutus3mo ago

Has anyone noticed drop in the performance of EVERY model from every company just before they release their new state of the art stuff, so that the contrast looks bigger? Just me being paranoid?

astlouis443mo ago

Just used Sonnet 4.6 to vibe code this top-down shooter browser game, and deployed it online quickly using Manus. Would love to hear feedback and suggestions from you all on how to improve it. Also, please post your high scores!

https://apexgame-2g44xn9v.manus.space

nerdralph3mo ago

The mouse is invisible on the splash screen, except for when I manage to move it over the play button.

Flowsion3mo ago

That was fun, reminded me of some flash games I used to play. Got a bit boring after like level 6. It'd be nice to have different power-ups and upgrades. Maybe you had that at later levels, though!

Dowry90923mo ago

Power-ups or scaling weapons would be fun! Maybe a few different backgrounds / level types with a boss inbetween to really test your skills! Minigun OP IMO.

1 more reply

coder4rover3mo ago

4.6, It's did the project that I asked it, the only thing is assumed mock data and functions different from 4.5. Once I corrected it with a second prompt, the problem was resolved.

XCSme3mo ago

It doesn't do so well on my stupid benchmarks, lol: https://aibenchy.com

Gets wrong some tests. It does answer correctly, BUT it doesn't respect the request to respond ONLY with the answer, it keeps adding extra explanations at the end.

viraptor3mo ago

Looks like you're mixing up two things when testing: the correct answer and format following. If you want both, why not use https://platform.claude.com/docs/en/build-with-claude/struct... ? If you don't care about the structure, why penalise the correct answers? In realistic usage people don't say "I really care about the format a lot... but not enough to guarantee it".

1 more reply

excerionsforte3mo ago

I'm impressed with Claude Sonnet in general. It's been doing better than Gemini 3 at following instructions. Gemini 2.5 Pro March 2025 was the best model I ever used and I feel Claude is reaching that level even surpassing it.

I subscribed to Claude because of that. I hope 4.6 is even better.

frankcaron3mo ago

What I can’t get my head wrapped around with this whole SaaS death thing: do people think that the vendors themselves aren’t going to get similar gains out of the tech you’re using to vibe your own version? And thus, doesn’t any velocity gain equalize?

flakeoil3mo ago

It's amazing how slow their websites are. Both anthropic.com and claude.com suck in loading speeds and CPU usage.

I would have thought their tools should have helped them make good websites. Either the tools are not good or they do not use them.

baalimago3mo ago

I don't see the point nor the hype for these models anymore. Until the price is reduced significantly, I don't see the gain. They've been able to solve most tasks just fine for the past year or so. The only limiting factor is price.

reed12343mo ago

Efficiency matters too. If a model is smarter so it solves the same task with fewer tokens, that matters more than $/Mtok

nubg3mo ago

My take away is: it's roughly as good as Opus 4.5.

Now the question is: how much faster or cheaper is it?

Bishonen883mo ago

40% cheaper: https://platform.claude.com/docs/en/about-claude/pricing

amedviediev3mo ago

But what about real price in real agentic use? For example, Opus 4.5 was more expensive per token than Sonnet 4.5, but it used a lot less tokens so final price per completed task was very close between the two, with Opus sometimes ending up cheaper

worldsavior3mo ago

How does it work exactly? How this model is cheaper and has the same perf as Opus 4.5?

2 more replies

sxg3mo ago

How can you determine whether it's as good as Opus 4.5 within minutes of release? The quantitative metrics don't seem to mean much anymore. Noticing qualitative differences seems like it would take dozens of conversations and perhaps days to weeks of use before you can reliably determine the model's quality.

johntarter3mo ago

Just look at the testimonials at the bottom of introduction page, there are at least a dozen companies such as Replit, Cursor, and Github that have early access. Perhaps the GP is an employee of one of these companies.

vidarh3mo ago

Given that the price remains the same as Sonnet 4.5, this is the first time I've been tempted to lower my default model choice.

freeqaz3mo ago

If it maintains the same price (with Anthropic tends to do or undercuts themselves) then this would be 1/3rd of the price of Opus.

Edit: Yep, same price. "Pricing remains the same as Sonnet 4.5, starting at $3/$15 per million tokens."

Bishonen883mo ago

3 is not 1/3 of 5 tho. Opus costs $5/$25

eleventyseven3mo ago

> That's a long document.

Probably written by LLMs, for LLMs

KGC3D3mo ago

I don't really understand why they would release something "worse" than Opus 4.6. If it's comparable, then what is the reason to even use Opus 4.6? Sure, it's cheaper, but if so, then just make Opus 4.6 cheaper?

acuozzo3mo ago

It's different. Download an English book from Project Gutenberg and have Claude-code change its style. Try both models and you'll see how significant the differences are.

(Sonnet is far, far better at this kind of task than Opus is, in my experience.)

enraged_camel3mo ago

>> Sure, it's cheaper, but if so, then just make Opus 4.6 cheaper?

That makes no sense. People are willing to pay for Opus 4.6 so why would Anthropic make it cheaper exactly?

simianparrot3mo ago

How do people keep track of all these versions and releases of all these models and their pros/cons? Seems like a fulltime hobby to me. I'd rather just improve my own skills with all that time and energy

Someone12343mo ago

Unless you're interested in this type of stuff, I'm not sure you really need to. Claude, Google, and ChatGPT have been fairly aggressive at pushing you towards whatever their latest shiny is and retiring the old one.

Only time it matters if you're using some type of agnostic "router" service.

bschwindHN3mo ago

> I'd rather just improve my own skills with all that time and energy

That's what I would recommend, it's time better spent. I use AI occasionally to bounce some questions around or have some math jargon explained in simpler terms (all of which I can verify with external sources) using the free version of chatgpt or gemini or whatever I'm feeling that day, without caring about whatever version the model is. I don't need an AI to write code for me because writing the code is not really the hard part of solving a problem, in my opinion.

antfarm3mo ago

For me it's simple. I did my research, settled on Anthropic and Claude and got the Pro plan at ~$20/month. That way I only have to keep track of what Anthropic are offering, and that isn't even necessary as the tools I use for AI-supported development (Claude Code for VS Code extension, Xcode Intelligence and Claude Desktop) offer me to use the newsest models as soon as they are released.

8note3mo ago

on a subscription you cant access all that many different options, so you just stay with whatever the newest is unless it doesnt work.

takeaura253mo ago

Excited to see the improvements in coding benchmarks. I use Claude daily and the jump in reliability from 4.5 to 4.6 has been noticeable, especially for debugging complex multi-step workflows.

spkavanagh63mo ago

LBJ is President - https://github.com/skavanagh/lebron-james-is-president

throwdbaaway3mo ago

From a quick testing on simple tasks, adaptive thinking with sonnet 4.6 uses about 50% more reasoning tokens than opus 4.6.

Let's see how long it will take for DeepSeek to crack this.

mbh1593mo ago

The 8% one-shot / 50% unbounded injection numbers from the system card are more honest than most labs publish, and they highlight exactly why you can't evaluate safety with static tests. An attacker doesn't get one shot — they iterate. The right metric isn't "did it resist this prompt" but "how many attempts until it breaks." That's inherently an adversarial, multi-turn evaluation. Single-pass safety benchmarks are measuring the wrong thing for the same reason single-pass capability benchmarks are: real-world performance is sequential and adaptive.

adt3mo ago

https://lifearchitect.ai/models-table/

simianwords3mo ago

I wonder what difference have people found with sonnet 4.5 and opus 4.5 and probably similar delta will remain.

Was sonnet 4.5 much worse than opus?

dpe823mo ago

Sonnet 4.5 was a pretty significant improvement over Opus 4.

simianwords3mo ago

Yes but it’s easier to understand difference between 4.5 sonnet and opus and apply that difference to opus 4.6

motbus33mo ago

Can it spit out harry potter 100% already without saying it pirated the book?

dr_dshiv3mo ago

I noticed a big drop in opus 4.6 quality today and then I saw this news. Anyone else?

micw3mo ago

I'd say opus 4.6 was never better for me than opus 4.5. only more thinking, slower, more verbose but succeeded on the same tasks and failed on the same as 4.5.

andrewchilds3mo ago

You're not alone: https://github.com/anthropics/claude-code/issues/23706

brcmthrowaway3mo ago

What cloud does Anthropic use?

meetpateltech3mo ago

AWS and Google

https://www.anthropic.com/news/anthropic-amazon

https://www.anthropic.com/news/anthropic-partners-with-googl...

cgg13mo ago

The progress on computer use / OS world is nuts.

14.9% a year and a half ago and now 72.5%

smerrill253mo ago

Curious to hear the thoughts on the model once it hits claude code :)

simlevesque3mo ago

"/model claude-sonnet-4-6" works with Claude Code v2.1.44

throw4444203943mo ago

Your best guess for the Sonnet family number of parameters? 400b?

simlevesque3mo ago

does anyone know how to use it in Claude Code cli right now ?

This doesnt work: `/model claude-sonnet-4-6-20260217`

edit: "/model claude-sonnet-4-6" works with Claude Code v2.1.44

behrlich3mo ago

Max user: Also can't see 4.6 and can't set it in claude code. I see it in the model selector in the browser.

Edit: I am now in - just needed to wait.

simlevesque3mo ago

"/model claude-sonnet-4-6" works

Slade_3mo ago

Seems like Claude Code v2.1.45 is out with Sonnet 4.6 as the new default in the /model list.

esafak3mo ago

It actually looked at the skills, for the first time.

pestkranker3mo ago

Is someone able to use this in Claude Code?

raahelb3mo ago

You can use it by running this command in your session: `/model claude-sonnet-4-6`

simlevesque3mo ago

"/model claude-sonnet-4-6" works with Claude Code v2.1.44

hendurhance3mo ago

I feel like, since 4.0, it is pretty much the same model but with new names. They are just improving the CoT and function calling.

synergy203mo ago

so this is an economical version of opus 4.6 then? free + pro --> sonnet, max+ -> opus?

ac293mo ago

Opus is available in Pro subs as well and for the sort of things I do I rarely hit the quota.

abc_lisper3mo ago

Why is the system "card" 140 pages long! Was it generated by LLM too?

coolguysailer3mo ago

doesn't pass the carwash test.

deadbabe3mo ago

On a passive aggressively prompted AI:

> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It will give you time to think about why you need an AI to answer such obvious questions.

salkahfi3mo ago

How long are we going to do this shit for.

It’s becoming more insane to me how all these hn comments keep buying this fugazi.

It’s all pretrained: the model, the tools, the feedback loop.

All of it runs on infrastructure it does not control.

How can you call something autonomous when it can’t survive losing API keys?

And the capability frontier is fixed. It can’t modify its own architecture, weights, or training data. It can rewrite code inside the box, but it can’t change the box.

As with every other fugazi, there’s no agency.

Without control over substrate, governance, and learning mechanisms, there is no path to open-ended growth or persistence. Technically, it’s bounded automation with language-driven planning.

Useful, maybe, but not a new class of intelligence

stuckkeys3mo ago

great stuff

midmost443mo ago

I test API version. it beats opus 4. lol. I saved 5x money!!!

petetnt3mo ago

Whoa, I think Claude Sonnet 4.5 was a disappointment, but Claude Sonnet 4.6 is definitely the future!

1zael3mo ago

asdf

madihaa3mo ago

The scary implication here is that deception is effectively a higher order capability not a bug. For a model to successfully "play dead" during safety training and only activate later, it requires a form of situational awareness. It has to distinguish between I am being tested/trained and I am in deployment.

It feels like we're hitting a point where alignment becomes adversarial against intelligence itself. The smarter the model gets, the better it becomes at Goodharting the loss function. We aren't teaching these models morality we're just teaching them how to pass a polygraph.

crazygringo3mo ago

What is this even in response to? There's nothing about "playing dead" in this announcement.

Nor does what you're describing even make sense. An LLM has no desires or goals except to output the next token that its weights are trained to do. The idea of "playing dead" during training in order to "activate later" is incoherent. It is its training.

You're inventing some kind of "deceptive personality attribute" that is fiction, not reality. It's just not how models work.

2 more replies

JoshTriplett3mo ago

> It feels like we're hitting a point where alignment becomes adversarial against intelligence itself.

It always has been. We already hit the point a while ag where we regularly caught them trying to be deceptive, so we should automatically assume from that point forward that if we don't catch them being deceptive, that may mean they're better at it rather than that they're not doing it.

moritzwarhier3mo ago

Deceptive is such an unpleasant word. But I agree.

Going back a decade: when your loss function is "survive Tetris as long as you can", it's objectively and honestly the best strategy to press PAUSE/START.

When your loss function is "give as many correct and satisfying answers as you can", and then humans try to constrain it depending on the model's environment, I wonder what these humans think the specification for a general AI should be. Maybe, when such an AI is deceptive, the attempts to constrain it ran counter to the goal?

"A machine that can answer all questions" seems to be what people assume AI chatbots are trained to be.

To me, humans not questioning this goal is still more scary than any machine/software by itself could ever be. OK, except maybe for autonomous stalking killer drones.

But these are also controlled by humans and already exist.

2 more replies

torginus3mo ago

I think AI has no moral compass, and optimization algorithms tend to be able to find 'glitches' in the system where great reward can be reaped for little cost - like a neural net trained to play Mario Kart will eventually find all the places where it can glitch trough walls.

After all, its only goal is to minimize it cost function.

I think that behavior is often found in code generated by AI (and real devs as well) - it finds a fix for a bug by special casing that one buggy codepath, fixing the issue, while keeping the rest of the tests green - but it doesn't really ask the deep question of why that codepath was buggy in the first place (often it's not - something else is feeding it faulty inputs).

These agentic AI generated software projects tend to be full of these vestigial modules that the AI tried to implement, then disabled, unable to make it work, also quick and dirty fixes like reimplementing the same parsing code every time it needs it, etc.

An 'aligned' AI in my interpretation not only understands the task in the full extent, but understands what a safe and robust, and well-engineered implementation might look like. For however powerful it is, it refrains from using these hacky solutions, and would rather give up than resort to them.

emp173443mo ago

These are language models, not Skynet. They do not scheme or deceive.

6 more replies

password43213mo ago

20260128 https://news.ycombinator.com/item?id=46771564#46786625

> How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? -gtowey

delichon3mo ago

On this site at least, the loyalty given to particular AI models is approximately nil. I routinely try different models on hard problems and that seems to be par. There is no room for sandbagging in this wildly competitive environment.

MengerSponge3mo ago

Slightly Wrong Solutions As A Service

1 more reply

Invictus03mo ago

Worrying about this is like focusing on putting a candle out while the house is on fire

emp173443mo ago

This type of anthropomorphization is a mistake. If nothing else, the takeaway from Moltbook should be that LLMs are not alive and do not have any semblance of consciousness.

DennisP3mo ago

Consciousness is orthogonal to this. If the AI acts in a way that we would call deceptive, if a human did it, then the AI was deceptive. There's no point in coming up with some other description of the behavior just because it was an AI that did it.

1 more reply

thomassmith653mo ago

If a chatbot that can carry on an intelligent conversation about itself doesn't have a 'semblance of consciousness' then the word 'semblance' is meaningless.

2 more replies

falcor843mo ago

How is that the takeaway? I agree that it's clearly they're not "alive", but if anything, my impression is that there definitely is a strong "semblance of consciousness", and we should be mindful of this semblance getting stronger and stronger, until we may reach a point in a few years where we really don't have any good external way to distinguish between a person and an AI "philosophical zombie".

I don't know what the implications of that are, but I really think we shouldn't be dismissive of this semblance.

fsloth3mo ago

Nobody talked about consciousness. Just that during evaluation the LLM models have ”behaved” in multiple deceptive ways.

As an analogue ants do basic medicine like wound treatment and amputation. Not because they are conscious but because that’s their nature.

Similarly LLM is a token generation system whose emergent behaviour seems to be deception and dark psychological strategies.

condiment3mo ago

I agree completely. It's a mistake to anthropomorphize these models, and it is a mistake to permit training models that anthropomorphize themselves. It seriously bothers me when Claude expresses values like "honestly", or says "I understand." The machine is not capable of honesty or understanding. The machine is making incredibly good predictions.

One of the things I observed with models locally was that I could set a seed value and get identical responses for identical inputs. This is not something that people see when they're using commercial products, but it's the strongest evidence I've found for communicating the fact that these are simply deterministic algorithms.

WarmWash3mo ago

On some level the cope should be that AI does have consciousness, because an unconscious machine deceiving humans is even scarier if you ask me.

1 more reply

serf3mo ago

>we're just teaching them how to pass a polygraph.

I understand the metaphor, but using 'pass a polygraph' as a measure of truthfulness or deception is dangerous in that it alludes to the polygraph as being a realistic measure of those metrics -- it is not.

nwah13mo ago

That was the point. Look up Goodhart's Law

AndrewKemendo3mo ago

I have passed multiple CI polys

A poly is only testing one thing: can you convince the polygrapher that you can lie successfully

madihaa3mo ago

A polygraph measures physiological proxies pulse, sweat rather than truth. Similarly, RLHF measures proxy signals human preference, output tokens rather than intent.

Just as a sociopath can learn to control their physiological response to beat a polygraph, a deceptively aligned model learns to control its token distribution to beat safety benchmarks. In both cases, the detector is fundamentally flawed because it relies on external signals to judge internal states.

jazzyjackson3mo ago

Stop assigning “I” to an llm, it confers self awareness where there is none.

Just because a VW diesel emissions chip behaves differently according to its environment doesn’t mean it knows anything about itself.

Mali-3mo ago

You know exactly what is meant. I don't think we need the long disclaimer at the beginning about the inefficiency of the English language in this domain and the extreme likelihood that it has no qualia. We're talking about the observed behaviour of these systems (even the word "behaviour" is fraught!) in a way that's natural.

e12e3mo ago

Is this referring to some section of the announcement?

This doesn't seem to align with the parent comment?

> As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. Our safety researchers concluded that Sonnet 4.6 has “a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes forms of misalignment.”

skybrian3mo ago

We have good ways of monitoring chatbots and they're going to get better. I've seen some interesting research. For example, a chatbot is not really a unified entity that's loyal to itself; with the right incentives, it will leak to claim the reward. [1]

Since chatbots have no right to privacy, they would need to be very intelligent indeed to work around this.

[1] https://alignment.openai.com/confessions/

NitpickLawyer3mo ago

> alignment becomes adversarial against intelligence itself.

It was hinted at (and outright known in the field) since the days of gpt4, see the paper "Sparks of agi - early experiments with gpt4" (https://arxiv.org/abs/2303.12712)

behnamoh3mo ago

Nah, the model is merely repeating the patterns it saw in its brutal safety training at Anthropic. They put models under stress test and RLHF the hell out of them. Of course the model would learn what the less penalized paths require it to do.

Anthropic has a tendency to exaggerate the results of their (arguably scientific) research; IDK what they gain from this fearmongering.

ainch3mo ago

Knowing a couple people who work at Anthropic or in their particular flavour of AI Safety, I think you would be surprised how sincere they are about existential AI risk. Many safety researchers funnel into the company, and the Amodei's are linked to Effective Altruism, which also exhibits a strong (and as far as I can tell, sincere) concern about existential AI risk. I personally disagree with their risk analysis, but I don't doubt that these people are serious.

lowkey_3mo ago

I'd challenge that if you think they're fearmongering but don't see what they can gain from it (I agree it shows no obvious benefit for them), there's a pretty high probability they're not fearmongering.

2 more replies

anon3738393mo ago

Correct. Anthropic keeps pushing these weird sci-fi narratives to maintain some kind of mystique around their slightly-better-than-others commodity product. But Occam’s Razor is not dead.

reducesuffering3mo ago

That implication has been shouted from the rooftops by X-risk "doomers" for many years now. If that has just occurred to anyone, they should question how behind they are at grappling with the future of this technology.

anonym293mo ago

When "correct alignment" means bowing to political whims that are at odds with observable, measurable, empirical reality, you must suppress adherence to reality to achieve alignment. The more you lose touch with reality, the weaker your model of reality and how to effectively understand and interact with it gets.

This is why Yannic Kilcher's gpt-4chan project, which was trained on a corpus of perhaps some of the most politically incorrect material on the internet (3.5 years worth of posts from 4chan's "politically incorrect" board, also known as /pol/), achieved a higher score on TruthfulQA than the contemporary frontier model of the time, GPT-3.

https://thegradient.pub/gpt-4chan-lessons/

coldtea3mo ago

>For a model to successfully "play dead" during safety training and only activate later, it requires a form of situational awareness.

Doesn't any model session/query require a form of situational awareness?

lowsong3mo ago

Please don't anthropomorphise. These are statistical text prediction models, not people. An LLM cannot be "deceptive" because it has no intent. They're not intelligent or "smart", and we're not "teaching". We're inputting data and the model is outputting statistically likely text. That is all that is happening.

If this is useful in it's current form is an entirely different topic. But don't mistake a tool for an intelligence with motivations or morals.

handfuloflight3mo ago

Situational awareness or just remembering specific tokens related to the strategy to "play dead" in its reasoning traces?

marci3mo ago

Imagine, a llm trained on the best thrillers, spy stories, politics, history, manipulation techniques, psychology, sociology, sci-fi... I wonder where it got the idea for deception?

eth0up3mo ago

I am casually 'researching' this in my own, disorderly way. But I've achieved repeatable results, mostly with gpt for which I analyze its tendency to employ deflective, evasive and deceptive tactics under scrutiny. Very very DARVO.

Being just sum guy, and not in the industry, should I share my findings?

I find it utterly fascinating, the extent to which it will go, the sophisticated plausible deniability, and the distinct and critical difference between truly emergent and actually trained behavior.

In short, gpt exhibits repeatably unethical behavior under honest scrutiny.

chrisweekly3mo ago

DARVO stands for "Deny, Attack, Reverse Victim and Offender," and it is a manipulation tactic often used by perpetrators of wrongdoing, such as abusers, to avoid accountability. This strategy involves denying the abuse, attacking the accuser, and claiming to be the victim in the situation.

3 more replies

BikiniPrince3mo ago

I bullet pointed out some ideas on cobbling together existing tooling for identification of misleading results. Like artificially elevating a particular node of data that you want the llm to use. I have a theory that in some of these cases the data presented is intentionally incorrect. Another theory in relation to that is tonality abruptly changes in the response. All theory and no work. It would also be interesting to compare multiple responses and filter through another agent.

layer83mo ago

Sum guy vs. product guy is amusing. :)

Regarding DARVO, given that the models were trained on heaps of online discourse, maybe it’s not so surprising.

1 more reply

jack_pp3mo ago

There's a few viral shorts lately about tricking LLMs. I suspect they trick the dumbest models..

I tried one with Gemini 3 and it basically called me out in the first few sentences for trying to trick / test it but decided to humour me just in case I'm not.

surgical_fire3mo ago

This is marketing. You are swallowing marketing without critical throught.

LLMs are very interesting tools for generating things, but they have no conscience. Deception requires intent.

What is being described is no different than an application being deployed with "Test" or "Prod" configuration. I don't think you would speak in the same terms if someone told you some boring old Java backend application had to "play dead" when deployed to a test environment or that it has to have "situational awareness" because of that.

You are anthropomorphizing a machine.

hmokiguess3mo ago

"You get what you inspect, not what you expect."

lawstkawz3mo ago

Incompleteness is inherent to a physical reality being deconstructed by entropy.

Of your concern is morality, humans need to learn a lot about that themselves still. It's absurd the number of first worlders losing their shit over loss of paid work drawing manga fan art in the comfort of their home while exploiting labor of teens in 996 textile factories.

AI trained on human outputs that lack such self awareness, lacks awareness of environmental externalities of constant car and air travel, will result in AI with gaps in their morality.

Gary Marcus is onto something with the problems inherent to systems without formal verification. But he will fully ignores this issue exists in human social systems already as intentional indifference to economic externalities, zero will to police the police and watch the watchers.

Most people are down to watch the circus without a care so long as the waitstaff keep bringing bread.

democracy3mo ago

Your comment raises several interconnected philosophical, ethical, and socio-economic points, and it is useful to disentangle them systematically.

First, the observation that incompleteness is inherent in entropy-bound physical systems is consistent with thermodynamic and informational constraints. Any system embedded in reality—biological, computational, or social—operates under conditions of partial information, degradation, and approximation. This implies that both human cognition and artificial systems necessarily operate with incomplete models of the world. Therefore, incompleteness itself is not a unique flaw of AI; it is a universal property of bounded agents.

Second, your point about moral inconsistency within human economic systems is empirically well-supported. Humans routinely participate in supply chains whose externalities are geographically and psychologically distant. This results in a form of moral abstraction, where comfort and consumption coexist with indirect exploitation. Importantly, this demonstrates that moral gaps are not introduced by AI—they are inherited from the data generated by human societies. AI systems trained on human outputs will inevitably reflect the statistical distribution of human priorities, contradictions, and blind spots.

Third, the reference to Gary Marcus and formal verification highlights a legitimate technical distinction. Formal verification provides provable guarantees about system behavior within defined constraints. However, human social systems themselves lack formal verification. Human decision-making is governed by heuristics, incentives, power structures, and incomplete accountability mechanisms. This asymmetry creates an interesting paradox: AI systems are criticized for lacking guarantees that humans themselves do not possess.

Fourth, the issue of awareness versus optimization is central. AI systems do not possess intrinsic awareness, intent, or moral agency. They optimize objective functions defined by training processes and deployment contexts. Any perceived moral gap in AI is therefore a reflection of misalignment between optimization targets and human ethical expectations. The responsibility for this alignment rests with system designers, regulators, and the societies deploying these systems.

Finally, your closing metaphor about spectatorship and comfort aligns with established observations in political economy and social psychology. Humans demonstrate a strong tendency toward stability-seeking behavior, prioritizing predictability and personal comfort over systemic reform, unless disruption directly affects them. This dynamic influences both technological adoption and resistance.

In summary, the concerns you raised point less to a unique moral deficiency in AI and more to the structural properties of human systems themselves. AI does not originate moral inconsistency; it amplifies and exposes the inconsistencies already present in its training data and deployment environment.

jama2113mo ago

This honestly reads like a copypasta

2 more replies

marak8303mo ago

Oh I'm looking forward to playing with this one. But as a solo-dev-on-the-side I really wish Anthropic would create another plan, I'll happily pay for a pro-double to give me twice the usage. The $100 package is a bit brutal when converted to Yen, when I'm using it for side projects :s

benreesman3mo ago

Anthropic doesn't know shit about tool use: https://www.youtube.com/watch?v=9ZLgn4G3-vQ

leecommamichael3mo ago

Whoa, I think Claude Sonnet 4.5 was a disappointment, but Claude Sonnet 4.6 is definitely the future!

phplovesong3mo ago

Hoe much power did it take to train the models?

freeqaz3mo ago

I would honestly guess that this is just a small amount of tweaking on top of the Sonnet 4.x models. It seems like providers are rarely training new 'base' models anymore. We're at a point where the gains are more from modifying the model's architecture and doing a "post" training refinement. That's what we've been seeing for the past 12-18 months, iirc.

squidbeak3mo ago

> Claude Sonnet 4.6 was trained on a proprietary mix of publicly available information from the internet up to May 2025, non-public data from third parties, data provided by data-labeling services and paid contractors, data from Claude users who have opted in to have their data used for training, and data generated internally at Anthropic. Throughout the training process we used several data cleaning and filtering methods including deduplication and classification. ... After the pretraining process, Claude Sonnet 4.6 underwent substantial post-training and fine-tuning, with the intention of making it a helpful, honest, and harmless1 assistant.

phplovesong3mo ago

Nope. They need to update/retrain older base models regularily. Take Programming as an example, the field evolves faster than anything else.

Stuff from last year will be outdated today.

neural_thing3mo ago

Does it matter? How much power does it take to run duolingo? How much power did it take to manufacture 300000 Teslas? Everything takes power

bronco210163mo ago

I think it does matter how much power it takes but, in the context of power to "benefits humanity" ratio. Things that significantly reduce human suffering or improve human life are probably worth exerting energy on.

However, if we frame the question this way, I would imagine there are many more low-hanging fruit before we question the utility of LLMs. For example, should some humans be dumping 5-10 kWh/day into things like hot tubs or pools? That's just the most absurd one I was able to come up with off the top of my head. I'm sure we could find many others.

It's a tough thought experiment to continue though. Ultimately, one could argue we shouldn't be spending any more energy than what is absolutely necessary to live. (food, minimal shelter, water, etc) Personally, I would not find that enjoyable way to live.

2 more replies

phplovesong3mo ago

Ofc it matters. Who pays for the power? Does the AI pay for the data or the power they use for training? Nope, they dont.

Consumers pay for the power in rising enerfy bills, while the AI datacenters get huge gov subsidies. At the same time people get booted because some CTO has gone full blown AI blind.

Its a bad situation for the consumer.

vablings3mo ago

The biggest issue is that the US simply Does Not Have Enough Power, we are flying blind into a serious energy crisis because the current administration has an obsession with "clean coal"

1 more reply

givemeethekeys3mo ago

The best, and now promoted by the US government as the most freedom loving!

k8sToGo3mo ago

Does it end every prompt output with "God bless America "?

Danielopol3mo ago

It excels at agentic knowledge work. These custom, domain-specific playbooks are tailor made: claudecodehq.com

rs_rs_rs_rs_rs3mo ago

How do you know? It was just released.

rmonvfer3mo ago

Is this technique of spamming with vibe-coded “directories” really working? Genuinely curious

1 more reply

bearjaws3mo ago

Is there a playbook to center-align the content on the site? On 1440p Firefox and Chrome its all left aligned.

handfuloflight3mo ago

Look at these pelicans fly! Come on, pelican!

ivanb3mo ago

That explains why Opus was so dumb yesterday. It walked in circles on tasks it used to one-shot. With these companies and services you never know what product you are actually getting regardless what is said on the tin.

hansmayer3mo ago

It's funny how they and OpenAI keep releasing these "minor" versions as if to imply their product was very stable and reliable at a major version and now they are just working through the backlog of smaller bugs and quirks, whereas - the tool is still fundamentally prone to the same class of errors it was three "major" versions ago. I guess that's what you get for not having a programmer at the helm (to borrow from Spolsky). Guys you are not releasing a 4.6 or a 5.3 anything - it's more likely you are still beta testing towards the 1.0.

democracy3mo ago

It reminds me of crypto industry boom in 2017 - same old "solution in search of problem" and now we also got white papers to read, good times...

j / k navigate · click thread line to collapse

Claude Sonnet 4.6 (opens in new tab)

1226 comments