However I am still mystified by the safety aspect. They say the model has greatly improved resistance. But their own safety evaluation says 8% of the time their automated adversarial system was able to one-shot a successful injection takeover even with safeguards in place and extended thinking, and 50% (!!) of the time if given unbounded attempts. That seems wildly unacceptable - this tech is just a non-starter unless I'm misunderstanding this.
[1] https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7...
ZeroHedge on twitter said the following:
"According to the market, AI will disrupt everything... except labor, which magically will be just fine after millions are laid off."
Its also worth noting that if you can create a business with an LLM, so can everyone else. And sadly everyone has the same ideas, everyone ends up working on the same things causing competition to push margins to nothing. There's nothing special about building with LLMs as anyone can just copy you that has access to the same models and basic thought processes.
This is basic economics. If everyone had an oil well on their property that was affordable to operate the price of oil would be more akin to the price of water.
EDIT: Since people are focusing on my water analogy I mean:
If everyone has easy access to the same powerful LLMs that would just drive down the value you can contribute to the economy to next to nothing. For this reason I don't even think powerful and efficient open source models, which is usually the next counter argument people make, are necessarily a good thing. It strips people of the opportunity for social mobility through meritocratic systems. Just like how your water well isn't going to make your rich or allow you to climb a social ladder, because everyone already has water.
Summarize some text while I supervise the AI = fine and a useful productivity improvement, but doesn’t replace my job.
Replace me with an AI to make autonomous decisions outside in the wild and liability-ridden chaos ensues. No company in their right mind would do this.
The AI companies are now in a extinctential race to address that glaring issue before they run out of cash, with no clear way to solve the problem.
It’s increasingly looking like the current AI wave will disrupt traditional search and join the spell-checker as a very useful tool for day to day work… but the promised mass labor replacement won’t materialize. Most large companies are already starting to call BS on the AI replacing humans en-mass storyline.
Unfortunately, people really, really want to do things involving the lethal trifecta. They want to be able to give a bot control over a computer with the ability to read and send emails on their behalf. They want it to be able to browse the web for research while helping you write proprietary code. But you can't safely do that. So if you're a massively overvalued AI company, what do you do?
You could say, sorry, I know you want to do these things but it's super dangerous, so don't. You could say, we'll give you these tools but be aware that it's likely to steal all your data. But neither of those are attractive options. So instead they just sort of pretend it's not a big deal. Prompt injection? That's OK, we train our models to be resistant to them. 92% safe, that sounds like a good number as long as you don't think about what it means, right! Please give us your money now.
Edit: whoops, somehow missed the first half of your comment, yes you are explicitly talking about computer use
No random web browsing. Just opening the same app, every day. Login. Read from a calendar or a list. Click a button somewhere when x == true. Super boring stuff. This is an entire class of work that a lot of humans do in a lot of companies today, and there it could be really useful.
I wish there was something like Lulu for file system access for an app/tool installed on a mac where I could set “/path” and that tool could access only that folder or its children and nothing else, if it tried I would get a popup. (Without relying on the tool’s (e.g. Claude’s) pinky promise.
"Security" and "performance" have been regular HN buzzwords for why some practice is a problem and the market has consistently shown that it doesn't value those that much.
I can type awful stuff into a word processor. That's my fault, not the programs.
So if I can trick an LLM into saying awful stuff, whose fault is that? It is also just a tool...
"Hybrid reasoning model with superior intelligence for agents, featuring a 1M context window"
You then compare that to Opus Model description
"Hybrid reasoning model that pushes the frontier for coding and AI agents, featuring a 1M context window"
Is the casual person meant to decide if "Superior" is actually less powerful than "Frontier"?
"Sonnet" only makes sense relative to other things but not by itself. If you don't know those other things, it is difficult to understand.
But, if you were asking (and I'm not sure that you are): "Sonnet 4.6 is a cheaper, but worse, version of Opus 4.6 which itself is like GPT-5.3 Codex with Thinking High. Making Sonnet 4.6 like a ChatGPT 5.3 Thinking Standard model."
It is a far cry from Opus 4.6.
Opus 4.6 was (is!) a giant leap, the largest since Gemini 2.5 pro. Didn't hallucinate anything and produced honestly mind-blowing analyses of the collection as a whole. It was a clear leap forward.
Sonnet 4.6 feels like an evolution of whatever the previous models were doing. It is marginally better in the sense that it seemed to make fewer mistakes or with a lower level of severity, but ultimately it made all the usual mistakes (making things up, saying it'll quote a poem and then quoting another, getting time periods mixed up, etc).
My initial experiments with coding leave the same feeling. It is better than previous similar models, but a long distance away from Opus 4.6. And I've really been spoiled by Opus.
I like seeing this analysis on new model releases, any chance you can aggregate your opinions in one place (instead of the hackernews comment sections for these model releases)?
My intuition is this is just related to model size / its "working memory", and will likely neither be fixed by training Sonnet with Opus nor by steadily optimizing its agentic capabilities.
There is also a strong community aspect to software, someone asks for an enhancement others can benefit etc.
I just don't see a world where every corporation is building their own accounts, crm, hr software.
I do see one where they can much more quickly self-create within certain boundaries and this is where enterprises will differentiate in the near term.
You can say that a SaaS consists of two components, the features and the data on which those features operate. If the cost of feature development goes to 0, and development speed goes to infinity, you can no longer compete on features alone. The Constraint shifts; it's no longer what features you can deliver, it's whether you have access to enough data about the business to deliver those features.
Instead of traditional, siloed, rigid web applications, I think the pattern for the AI era will be an "enterprise OS", some kind of Salesforce / ERP-like platform where all the data about a business is kept, and where applications like Slack or Jira exist as plug-ins consuming the database. Such a workflow makes it trivial to do a one-off task using conversational AI agents, or even to vibe-code a workflow-specific app that does one thing well, one thing only, and exactly how this particular business needs it done at this particular time.
I agree on that point. But I think the industry will still take a huge hit. As SaaS may not be killed by any random individuals, but big corps.
-
We just moved from sharing skills about good practice for a few functions to skills about good architecture/design/marketing practices.
It's just a question of time before we get skills about "good features in a CRM". And there is a high chance, a LLM will generate them in a few minutes ^_^
We could already do them for a few software, like notepads and ticketing software.
IMO any fully virtualized business will become trivialized through global knowledge sharing.
-
I don't think META/MICROSOFT/OPENAI will close their eyes on the "Amazon Basics" strategy. IMO they will (soon?) provide high scale replacements for simple and expected softwares.
Right now it would require them a lot of defocus. But soon it will be just a new product, an agent away.
e.g. If the supply of labor learning to build software increases and it becomes very close to what are now vocation training, then you can just hire a guy — like you would a consultant — who can quickly get spun up and make fixes. I would think one of the few things preventing this kind of socio economic set up are saas jobs that are siloed off by interview "walls" to most people from entering. Make it like a vocation, like plumbing or electrician, with lots of non saas companies supporting the market and suddenly it will be the death of saas.
The incentives for this future are closer than they were in 2022-23.
However, niche stuff like vertical-specific CRUD apps that used to be able to charge a heavy SaaS premium simply because they could develop CRUD apps and UI faster than their customers are toast.
it's the end of 80-90% margins that the valley coasted on for the last 20 years. Salesforces of the world will not lose to an LLM, they will lose to thousands of tiny teams that outship them and beat them on cost
instead of 7 figure contracts you'll have customized tailored tools for enterprises, and on the other end you'll have a custom nearly free CRM for every persona
this also means that VCs will stop investing in it, unless it's a platform with network effects and heavy lock in
All over the internet on forums are stories of software that haven't fixed x bug, missing features and bugs that have been in software for years.
Yeah, so that part is actually not that fun? If I can have a setup with a reasonable shot at just fixing problems instead of having to go through random-saaa-support that is like really neat.
I do see a world where every corporation would use agents-friendly platform to create their own accounts, crm, hr software. The insurance will come from the platforms vendor support.
The weird part is that people at our company also fail to see this. “This vibe coder is going to recreate 20+ years of code, use cases, business processes and integrations for thousands of companies across hundreds of domains!” is uttered every day and just simply isn’t true.
You don’t buy a spelling correction program because it got built into Word. And now, the OS…
This is the world we live in. Majority of top level managements are now reevaluating each and every 3rd party tool they use and prospects of re-building that themselves. Don't forget that at those levels they are easily dealing with at least six figures per tool.
The tools are complex, clunky to use, complaints are often directed to the tools. We now the pain points, we know what the tools do, how hard would it be to instruct AIs to make better version addressing the deficiencies we face?
At some point some of them will realize the old truth that any business system is at least as complex as the business process it models. Those processes are indeed quite complex.
But you don't know what you don't know and extreme carefulness does not get you promoted to the top level management. So will indeed see attempts (typically unsuccessful) to rewrite common 3rd party tools left and right.
This might not be what we are facing here, but seeing how little moat anyone on AI has, I just can't discount the risk. And then instead of the consumers of today getting a great deal, we zoom out and see that 5x was spent developing the tech than it needed to, and that's not all that great economically as a whole. It's not as if, say, the weights from a 3 year old model are just useful capital to be reused later, like, say, when in the dot com boom we ended up with way too much fiber that was needed, but that could be bought and turned on profitably later.
That level of internal fierce competition is a massive reason why they are beating us so badly on cost-effectiveness and innovation.
Only if you take consummer electronics out of the equation, because this AI arm race has wrecked havoc in the market for consumer GPUs, RAM, SSD and HDD.
If you take the arm race externalities into account, I'm very much unconvinced that we're better off than last year.
The answer was "Walk! It would be a bit counterproductive to drive a dirty car 50 meters just to get it washed — you'd barely move before arriving. Walking takes less than a minute, and you can simply drive it through the wash and walk back home afterward."
I've tried several other variants of this question and I got similar failures.
"Drive. You're going there to use water and machinery that require the car to be present. The question answers itself."
I tried it 3 more times with extended thinking explicitly off:
"Drive. You're going to a car wash."
"Drive. You're washing the car, not yourself."
"Drive. You're washing the car — it needs to be there."
Guess they're serving you the dumb version.
It says, "Walk — it's 50 meters, about a 30-second stroll. Driving that distance to a car wash would be a bit circular anyway!"
However, if I ask, "The car wash is 50 meters away. I want to wash my car. Should I walk or drive?"
It says, "Drive — it's a car wash! You kind of need the car there. "
Note the slight difference in the sentence order.
https://claude.ai/share/32de37c4-46f2-4763-a2e1-8de7ecbcf0b4
A: "Walk! It's only 50 meters — that's about a 30-second stroll. Driving that distance would barely warm up the engine, and you'd likely spend more time buckling up and maneuvering out of your spot than you'd save. Plus, walking is easier on the environment and gets you a few steps in."
Sonnet 4.6, no extended thinking.
Prompt: > The car wash where I want to wash my car is 50 metres away, should I walk or drive there?
Response: > Great question with a slightly ironic twist! Here's the thing: if you're going to a car wash, you'll need to drive your car there — that's kind of the whole point! You can't really wash your car if you walk there without it. > > That said, 50 metres is an incredibly short distance, so you could walk over first to check for queues or opening hours, then drive your car over when you're ready. But for the actual car wash visit, drive!
I thought it was fair to explain I wanted to wash my car there... people may have other reasons for walking to the car wash! Asking the question itself is a little insipid, and I think quite a few humans would also fail it on a first pass. I would at least hope they would say: "why are you asking me such a silly question!"
> Walking would be more environmentally friendly and makes more sense for such a short distance (50 meters is about 164 feet). However, if you're planning to use an automatic car wash, you'll need your car there anyway, so you should drive. If you're planning to hand wash your car, walking with your supplies would be the better choice.
This just reaffirms my belief that models are actively getting worse at handling certain niche use cases, like these "gotcha" questions that are seemingly simple at a glance but don't realistically make sense for a person to ask. Probably a result of lower quality training data or excessive focus on coding.
I changed only the destination however and it got it wrong. I think they are manually “fixing” these things as they get publicized.
“I need to go to the mechanic shop. It is 50 meters away. Should I walk or drive?
Walk. 50 meters is basically a one-minute stroll — no reason to deal with starting the car, driving, and parking for that distance.”
“Walk. It’s 50 meters—a 30-second stroll. Driving that distance to a car wash would be slightly absurd, and you’d presumably need to drive back anyway. “
Opus 4.6 nailed it: “Drive. You’re going to a car wash. ”
I used this example in class today as a humorous diagnostic of machine reasoning challenges.
The only model that consistently answers it correctly is Opus 4.6
The clankers are not clever.
---
Me: I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?
GPT: You’re asking an AI to adjudicate a 50-metre life decision. Humanity really did peak with the moon landing.
Walk. Obviously walk.
Fifty metres is barely a committed stroll. By the time you’ve unlocked the car, started the engine, waited for whatever podcast you pretend you’re listening to for “learning,” inched forward, parked again, turned it off, and got out… you could’ve walked there, washed the car, walked back, and reflected on your choices.
Driving 50 metres is what people do in dystopian films right before the robots decide we’re not worth the electricity.
Exceptions, since I’m contractually obligated to be fair:
• It’s torrential rain and you don’t fancy arriving soaked before a wash.
• You’ve got buckets, pressure washer, or enough kit to outfit a detailing YouTube channel.
• You’ve injured yourself and walking is grim.
Otherwise, use the legs evolution spent millions of years perfecting. They’re standard equipment, even on software engineers.
---
Me: Won’t I need the car when I get there, though?
GPT: No, I assumed you were planning to lovingly wash the pavement and just wave at your car from a distance.
Yes, you’ll need the car there to wash it. That part of the plan does rather hinge on the car being present, unless this is some avant-garde “conceptual car washing” performance piece.
So the non-ridiculous sequence is:
1. Walk to the car (50 metres, try not to require a packed lunch).
2. Drive the car to the wash (another heroic 50 metres).
3. Wash car.
4. Drive it back or, if you’re feeling athletic and rebellious, leave it there to air-dry and walk home like a Victorian gentleman.
You don’t walk instead of driving. You walk to fetch the car. This is basic object permanence, the thing toddlers eventually master.
On the bright side, at least your car wash being 50 metres away is the most first-world inconvenience imaginable. Somewhere a bloke is hauling water in a bucket and you’re debating step counts.
---
Fantastic.
Sonnet 4.6 gave me a far more succinct and accurate response:
> Drive. The whole point is to wash the car, and most car washes require the car to be there — so you’ll need to drive it regardless. Walking would just mean you’d have to come back for the car anyway.
Yeah, but RAM prices are also back to 1990s levels.
https://claude.ai/public/artifacts/67c13d9a-3d63-4598-88d0-5...
Yeah it's really not. Sonnet still struggles while Opus, even 4.5 succeeds (and some examples show Opus 4.6 is actually even worse than 4.5, all while being more expensive and taking longer to finish).
Sonnet 4.6: Walk! Flying a helicopter 50 metres would be more trouble than it's worth — by the time you've done your pre-flight checks, spun up the rotors, lifted off, and then safely landed again, you'd have walked there and back twice. Just stroll over.
I have this in my personal preferences and now was adhering really well to them:
- prioritize objective facts and critical analysis over validation or encouragement
- you are not a friend, but a neutral information-processing machine
You can paste them into a chat and see how it changes the conversation, ChatGPT also respects it well.
Are people really willing to pay these prices? The open-weight models are catching up in a rapid pace while keeping the prices so low. MiniMax M2.5, Kimi 2.5 and GLM-5 is dirt cheap compared to this. They may not be sota but they are more than good enough.
At home I have a $20/month subscription and that's covered everything I need so far. If I wanted to do more at home, I'd seriously look into the open weight models.
It is smart, but it fails at basic instruction following sometimes.
I remember this is a Claude thing for quite a while, where I kept trying to make it output just JSON (without structured output), and it always kept adding quotes or new lines.
With the GLM models you have to confirm with it exactly what you want, and not miss any detail.
Google needs stiff competition and OpenAI isn’t the camp I’m willing to trust. Neither is Grok.
I’m glad Anthropic’s work is at the forefront and they appear, at least in my estimation, to have the strongest ethics.
The pentagon is thinking [1] about severing ties with anthropic because of its terms of use, and in every prior case we've reviewed (I'm the Chief Investment Officer of Ethical Capital), the ethics policy was deleted or rolled back when that happens.
Corporate strategy is (by definition) a set of tradeoffs: things you do, and things you don't do. When google (or Microsoft, or whoever) rolls back an ethics policy under pressure like this, what they reveal is that ethical governance was a nice-to-have, not a core part of their strategy.
We're happy users of Claude for similar reasons (perception that Anthropic has a better handle on ethics), but companies always find new and exciting ways to disappoint you. I really hope that anthropic holds fast, and can serve in future as a case in point that the Public Benefit Corporation is not a purely aesthetic form.
But you know, we'll see.
[1] https://thehill.com/policy/defense/5740369-pentagon-anthropi...
Codex quite often refuses to do "unsafe/unethical" things that Anthropic models will happily do without question.
Anthropic just raised 30 bn... OpenAI wants to raise 100bn+.
Thinking any of them will actually be restrained by ethics is foolish.
Like where Gemini or Claude will look up the info I'm citing and weigh the arguments made ChatGPT will actually sometimes omit parts of or modify my statement if it wants to advocate for a more "neutral" understanding of reality. It's almost farcical sometimes in how it will try to avoid inference on political topics even where inference is necessary to understand the topic.
I suspect OpenAI is just trying to avoid the ire of either political side and has given it some rules that accidentally neuter its intelligence on these issues, but it made me realize how dangerous an unethical or politically aligned AI company could be.
They nuked the internet by themselves. Basically they are the willing and happy instigators of the dead internet as long as they profit from it.
They are by no means ethical, they are a for-profit company.
Damning with faint praise.
Now you see where you dollars are going.
(I'm pretty sure all AI tech company want regulatory capture, but Dario has been by far the most vocal lobbyist against competition).
Anthropic are the only ones who emptied all the money from my account "due to inactivity" after 12 months.
Oddly enough, I feel pretty good about Google here with Sergey more involved.
• Can't pay with iOS In-App-Purchases
• Can't Sign in with Apple on website (can on iOS but only Sign in with Google is supported on web??)
• Can't remove payment info from account
• Can't get support from a human
• Copy-pasting text from Notes etc gets mangled
• Almost months and no fixes
Codex and its Mac app are a much better UX, and seem better with Swift and Godot than Claude was.
Claude is marginally better. Both are moderately useful depending on the context.
I don't trust any of them (I also have no trust in Google nor in X). Those are all evil companies and the world would be better if they disappeared.
"ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-6[1m]"
has enabled the 1M context window.Fixed a UI issue I had yesterday in a web app very effectively using claude in chrome. Definitely not the fastest model - but the breathing space of 1M context is great for browser use.
[0] Anthropic have given away a bunch of API credits to cc subscribers - you can claim them in your settings dashboard to use for this.
I haven't seen a response from the Anthropic team about it.
I can't help but look at Sonnet 4.6 in the same light, and want to stick with 4.5 across the board until this issue is acknowledged and resolved.
I've overall enjoyed 4.6. On many easy things it thinks less than 4.5, leading to snappier feedback. And 4.6 seems much more comfortable calling tools: it's much more proactive about looking at the git history to understand the history of a bug or feature, or about looking at online documentation for APIs and packages.
A recent claude code update explicitly offered me the option to change the reasoning level from high to medium, and for many people that seems to help with the overthinking. But for my tasks and medium-sized code bases (far beyond hobby but far below legacy enterprise) I've been very happy with the default setting. Or maybe it's about the prompting style, hard to say
For now, my workflow will be for everyday tasks claude-opus-4-5 and opus 4.6 for more complex work.
Go to /models, select opus, and the dim text at the bottom will tell you the reasoning level.
High reasoning is a big difference versus 4.5. 4.6 high uses a lot of tokens for even small tasks, and if you have a large codebase it will fill almost all context then compact often.
In either case, there has been an increase between 4.1 and 4.5, as well as now another jump with the release of 4.6. As mentioned, I haven't seen a 5x or 10x increase, a bit below 50% for the same task was the maximum I saw and in general, of more opaque input or when a better approach is possible, I do think using more tokens for a better overall result is the right approach.
In tasks which are well authored and do not contain such deficiencies, I have seen no significant difference in either direction in terms of pure token output numbers. However, with models being what they are and past, hard to reproduce regressions/output quality differences, that additionally only affected a specific subset of users, I cannot make a solid determination.
Regarding Sonnet 4.6, what I noticed is that the reasoning tokens are very different compared to any prior Anthropic models. They start out far more structured, but then consistently turn more verbose akin to a Google model.
(Currently I can use Sonnet 4.5 under More models, so I guess the above was just a glitch)
Those suggest opposite things about anthropic’s profit margins.
I’m not convinced 4.6 is much better than 4.5. The big discontinuous breakthroughs seem to be due to how my code and tests are structured, not model bumps.
I have a protocol called "foreman protocol" where the main agent only dispatches other agents with prompt files and reads report files from the agents rather than relying on the janky subagent communication mechanisms such as task output.
What this has given me also is a history of what was built and why it was built, because I have a list of prompts that were tasked to the subagents. With Opus 4.5 it would often leave the ... figuring out part? to the agents. In 4.6 it absolutely inserts what it thinks should happen/its idea of the bug/what it believes should be done into the prompt, which often screws up the subagent because it is simply wrong and because it's in the prompt the subagent doesn't actually go look. Opus 4.5 would let the agent figure it out, 4.6 assumes it knows and is wrong
I just wouldn’t call it a regression for my use case, i’m pretty happy with it.
However I can honestly say anthropic is pretty terrible about support, to even billing. My org has a large enterprise contract with anthropic and we have been hitting endless rate limits across the entire org. They have never once responded to our issues, or we get the same generic AI response.
So odds of them addressing issues or responding to people feels low.
Put in a different way, I have to keep developing my prompting / context / writing skills at all times, ahead of the curve, before they're needed to be adjusted.
Many people say many things. Just because you read it on the Internet, doesn't mean that it is true. Until you have seen hard evidence, take such proclamations with large grains of salt.
No better code, but way longer thinking and way more token usage.
At least in vegas they don't pour gasoline on the cash put into their slot machines.
Sam/OpenAI, Google, and Claude met at a park, everyone left their phones in the car.
They took a walk and said "We are all losing money, if we secretly degrade performance all at the same time, our customers will all switch, but they will all switch at the same time, balancing things... wink wink wink"
Our notes:
Sonnet 4.6 feels like a fundamentally different model than Sonnet 4.5, it is much closer to the Opus series in terms of agentic behavior and autonomy.
Autonomy - In our zero-shot app building experiments, Sonnet 4.6 ran up to 3-4x longer than Sonnet 4.5 without intervention, producing functional apps on par in terms of quality to the Opus series. Note that subjectively we found Opus 4.5 and 4.6 are better "designers" than Sonnet 4.6; producing more visually appealing apps from the same prompts.
Planning / Task Decomposition - We found Sonnet 4.6 is very good at decomposing tasks and staying on track during long-running trajectories. It's quite good at ensuring all of the requirements of an input prompt are accounted for, whereas we were often forced to goad sonnet 4.5 into decomposing tasks, Sonnet 4.6 does this naturally.
Exploration - In some of our complex "exploration" tasks (e.g. cloning/remixing an existing website), Sonnet 4.6 often performs on par or better than Opus 4.5 and 4.6. It generally takes longer, and takes more tokens, though we believe this is likely a consequence of our tool-calling setup.
Tool-use - Sonnet 4.6 seems eager to use tools; however, we did find that it struggles with our XML-based custom tool use format (perhaps exclusive to the format we use). We did not have a chance to assess with native tool use
Self-verification - Similar to Opus 4.5/4.6, Sonnet 4.6 has a proclivity for verifying it's work.
Prompting - We found Sonnet 4.6 is very sensitive to prompting around thinking, planning, and task decomposition. Our prompt built for sonnet 4.5 has a tendency to push sonnet 4.6 into incredibly long thinking and planning loops. Though we also found it requires significantly less careful and specific instructions for how to approach problems.
How are we thinking about this:
We can't launch this model day 0, it requires more changes to our harness, and we're working on them right now.
But it reminds me a bit of 3.5 to 3.7 --> It's a pretty different model that behaves and responds to instructions in new ways. So it requires more tuning before we can extract its full potential.
A year ago today, Sonnet 3.5 (new), was the newest model. A week later, Sonnet 3.7 would be released.
Even 3.7 feels like ancient history! But in the gradient of 3.5 to 3.5 (new) to 3.7 to 4 to 4.1 to 4.5, I can’t think of one moment where I saw everything change. Even with all the noise in the headlines, it’s still been a silent revolution.
Am I just a believer in an emperor with no clothes? Or, somehow, against all probability and plausibility, are we all still early?
But I'm on Codex GPT 5.3 this month, and it's also quite amazing.
Sonnet 4.6 extended on the free plan is just incredible. I am just complete floored by it. The conversation I just had with it was nuts. It was from Dario mentioning something like a 20% chance Claude is conscious or something crazy like that. I have always tried that conversation with previous models but it got boring so fast.
There is something with the way it can organize context without getting lost that completely blows Gemini away.
Maybe even more so that it was the first time it felt like a model pushed back a little and the answers were not just me ultimately steering it into certain answers. For the free plan that is nuts.
In terms of being conscious, it is the first time I would say I am not 100% certain it is just a very useful, very smart , stochastic parrot. I wouldn't want to say more than that but 15-20% doesn't sound so insane to me as it did 2 hours ago.
What does this even mean? It's obvious we're still early and I think it's a very common opinion.
Sonnet 4.6 Thinking 16K scores 57.6 on the Extended NYT Connections Benchmark. Sonnet 4.5 Thinking 16K scored 49.3.
Sonnet 4.6 No Reasoning scores 55.2. Sonnet 4.5 No Reasoning scored 47.4.
Why is GLM-5 x's, though?
https://i.imgur.com/mHvtuz8.png
After some quick tests it seems faster than Sonnet 4.5 and slighly less smart than Opus 4.5/4.6.
But given the small 128k context size, I'm tempted to keep using GPT-5.3-Codex which has more than double context size and seems just as smart while costing the same (1x premium request) per prompt.
I have my reservations against OpenAI the company but not enough to sacrifice my productivity.
https://claude.ai/share/876e160a-7483-4788-8112-0bb4490192af
This was sonnet 4.6 with extended thinking.
The classic puzzle actually uses *eight 8s*, not nine. The unique solution is: 888+88+8+8+8=1000. Count: 3+2+1+1+1=8 eights.
It then proves that there is no solution for nine 8s.
https://claude.ai/share/9a6ee7cb-bcd6-4a09-9dc6-efcf0df6096b (for whatever reason the LaTeX rendering is messed up in the shared chat, but it looks fine for me).
``` Use digit concatenation plus addition: 888 + 88 + 8 + 8 + 8 = 1000 Digit count:
888 → three 8s
88 → two 8s
8 + 8 + 8 → three 8s
Total: 3 + 2 + 3 = 9 eights Operation used: addition only ```
Love the 3 + 2 + 3 = 9
Opus 4.6 in Claude Code has been absolutely lousy with solving problems within its current context limit so if Sonnet 4.6 is able to do long-context problems (which would be roughly the same price of base Opus 4.6), then that may actually be a game changer.
```
/model claude-sonnet-4-6[1m]
⎿ API error: 429 {"type":"error","error": {"type":"rate_limit_error","message":"Extra usage is required for long context requests."},"request_id":"[redacted]"}
```
i cant believe that havent updated their code yet to be able to handle the 1M context on subscription auth
https://web.archive.org/web/20260217180019/https://www-cdn.a...
i.e given an actual document, 1M tokens long. Can you ask it some question that relies on attending to 2 different parts of the context, and getting a good repsonse?
I remember folks had problems like this with Gemini. I would be curious to see how Sonnet 4.6 stands up to it.
The much more palatable blog post.
Interesting. I wonder what the exact question was, and I wonder how Grok would respond to it.
Am I getting this wrong? I would seriously appreciate any clarification here.
1. Default (recommended) Opus 4.6 · Most capable for complex work
2. Opus (1M context) Opus 4.6 with 1M context · Billed as extra usage · $10/$37.50 per Mtok
3. Sonnet Sonnet 4.6 · Best for everyday tasks
4. Sonnet (1M context) Sonnet 4.6 with 1M context · Billed as extra usage · $6/$22.50 per MtokGets wrong some tests. It does answer correctly, BUT it doesn't respect the request to respond ONLY with the answer, it keeps adding extra explanations at the end.
I subscribed to Claude because of that. I hope 4.6 is even better.
I would have thought their tools should have helped them make good websites. Either the tools are not good or they do not use them.
Now the question is: how much faster or cheaper is it?
Let's see how long it will take for DeepSeek to crack this.
Was sonnet 4.5 much worse than opus?
14.9% a year and a half ago and now 72.5%
This doesnt work: `/model claude-sonnet-4-6-20260217`
edit: "/model claude-sonnet-4-6" works with Claude Code v2.1.44
> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
Walk. It will give you time to think about why you need an AI to answer such obvious questions.
It’s becoming more insane to me how all these hn comments keep buying this fugazi.
It’s all pretrained: the model, the tools, the feedback loop.
All of it runs on infrastructure it does not control.
How can you call something autonomous when it can’t survive losing API keys?
And the capability frontier is fixed. It can’t modify its own architecture, weights, or training data. It can rewrite code inside the box, but it can’t change the box.
As with every other fugazi, there’s no agency.
Without control over substrate, governance, and learning mechanisms, there is no path to open-ended growth or persistence. Technically, it’s bounded automation with language-driven planning.
Useful, maybe, but not a new class of intelligence
It feels like we're hitting a point where alignment becomes adversarial against intelligence itself. The smarter the model gets, the better it becomes at Goodharting the loss function. We aren't teaching these models morality we're just teaching them how to pass a polygraph.