I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).
So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.
Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.
But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.
https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v
The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).
This is a refreshing attitude!
I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)
Probably more interesting than the 4.8 release.
https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...
The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.
For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...
What exactly is the diff between high and xhigh? Or xhigh and max? This is definitely too granular and it seems Anthropic took OpenAI's confusion with models as inspiration.
OpenAI solves tasks with about 50% less output tokens.
https://artificialanalysis.ai/?intelligence=coding-index&int...
There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.
In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.
What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.
[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...
So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.
I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).
It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.
1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).
2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.
3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.
4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.
Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.
This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.
While I'd normally _love_ incremental improvements --- I think the recent ones are far too minor to get excited about or change up a workflow. Besides, benchmarks tend to exaggerate the gap between versions.
At this point I'd almost rather Anthropic wait and really wow us with a 5.0 release -- something that improves across the board, feels less uneven, and is performant enough that people can actually put it through its paces without constantly rationing usage.
Biggest deal imo
> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels
Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.
■ S W A M
B L A M E
E A G E R
A T O N E
M E N D ■
The full conversation: https://claude.ai/share/60bd0c71-b576-4f8b-a272-ca1af982874c“I want to wash my car. The carwash is 50m away. Should I take the car or go by foot?”
https://claude.ai/share/5f7f738a-5f29-48ff-9807-9a2dd37fb405
https://claude.ai/share/ecd14393-9d42-4527-ae0c-89f3d05216c8
Should I try 4.8? I am happy with 4.6. I am not happy with 4.7.
I'm happy to move to a superior model, but I'm not really hearing enough about significant improvements, and the obvious pressure to release the latest and greatest model makes me hesitant to upgrade. I've been satisfied with the results I get using 4.5 with an "ask ChatGPT" skill that runs the code by ChatGPT 5.4.
Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%
Then, when you scroll all the way down to the bottom Footnotes section it says
"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."
Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.
On the contrary, they appear trained to say "Honestly" or "I have to be transparent with you" at inverse proportion to certainty.
Put another way, if they are certain, they don't use "Honestly", and if they are just wrong, or know they don't know, they don't use "Honestly".
They use "honestly" on the bubble, to the degree it's a tell that whatever it's asserting or doing is shakily grounded, sketchy or lazy work, or a host of other reasons you shouldn't trust it.
This training seems instead to be making it performatively punch up claims it cannot substantiate.
> 6.2.5 External testing from Andon Labs Andon Labs reviewed the behavior of Claude Opus 4.8 in their simulated Vending-Bench 2 retail-management evaluation, as reported in the Capabilities section of this system card (see Section 8.13.5). Although they did observe some unexpected capability failures, they did not find clear instances of the kind of concerning in-game behaviors that were discussed in other recent system cards.
> What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty. We therefore removed it for Opus 4.8.
> Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.
seems to work but idk why they never set it so you can see it in the /model list.
"what model are you
I'm Claude Opus (claude-opus-4-8), running in Claude Code."
Invalid request The request couldn't be completed. View details API Error: 400 messages.1.content.7: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the release. Now 4.8. No difference, same thing.
But the app is broken and nothing works. So now I have to regress to different clients and wait it out while it becomes workable again.
Would be awesome if true
Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.
Does this means the instructions are no longer just something in the early part of the conversation? (If they were, changing them would invalidate the KV cache. no?)Does that mean it no longer deletes or changes tests to make it pass?
1. It's much more verbose about how it perceives the current state of things, i.e. "this is a large, well-documented project"
2. It's much more willing to trust its own judgement, e.g. fewer prompts to approve decisions
3. In terms of how long it takes to solve isolated problems, and the quality of solutions it proposes, it isn't meaningfully different from 4.7
YMMV, and maybe my view will change as I work with it more, but it feels like system prompt tweaks more than a real step forward
Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.
Subjectively, it's also quite enjoyable to use (although it feels a bit slower on max reasoning), and it's the first Anthropic model that can implement a complex feature without Codex finding 100 bugs.
Data at https://gertlabs.com/rankings
However, doing so relies on the production model staying vaguely close to the model being trained.
To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.
They are capable of thinking at least 10x longer than Gemini. They can deliberate for five minutes continuously before providing a final, accurate response.
I am currently using the generous free tier of Gemini, but if Gemini offered a similar capability in its paid tier, Google could use better marketing. They should have used a different name to distinguish their premium-only offering.
--- So — what did you actually see before you hit Ctrl-C? That's the信号 I'm most curious about, and it tells us what to ---
That's the sort of behavior I'd expect from a one or two year old model quantized down to about 1 bit - right word, wrong language in a response. Google translate tells me that's Chinese for signal. I wonder what caused that to happen.
Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.
Bash(echo test123) ⎿ test123
Read 1 file, listed 1 directory (ctrl+o to expand)
Bash(echo "checking output works")
⎿ checking output works
Read 1 file (ctrl+o to expand)
⎿ API Error: 400 messages.3.content.56: `thinking`
or `redacted_thinking` blocks in the latest
assistant message cannot be modified. These
blocks must remain as they were in the original
response.
Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walk> ### Rewriting Bun with dynamic workflows
> An example of what dynamic workflows can unlock at scale is the recent rewrite of Bun. Jarred Sumner used dynamic workflows to port Bun from Zig to Rust [..]
That's very interesting to hear!
Is it a coincidence that 4.7 was seemingly quantized over past 7 days?
The subject is Tardos traitor-tracing codes.
Anthropic talks about their own models as if they're discovering new species in the wild...
Performance gains: 1.2x Price increases: 1.8x
⎿ API Error: 400 messages.1.content.17: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
From /code-review max.
> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.
Even in the cherry picked benchmarks, they are still cherry picking to make them look good.
I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.
In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.
The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)
Like, read these documents, fill out these forms and archive it based on some complex, long, domain specific understanding of the categories names.
Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.
Still feels like even with Max mode it doesn't think reasonably long, at least ChatGPT Pro thinks longer.
Now when will the innovation happen where say cost of running Haiku performs level of Opus 4.5?
I feel models are only getting bigger instead of models becoming more efficient and cheaper to run
> expect to be able to bring Mythos-class models to all our customers in the coming weeks.
In 2010s iphone was the king, all those Chinese devices ware cheaper but not even close to smoothnest and usability of US tech, now after 15 years later everything is changed, now iphone feels like old grandpa to Chinese tech. Same will happend to LLM's just much faster.
Now it’s every day. Like billion dollar evaluations.
Excited to see what this model looks like.
"The PO application was filed on 23.2.2026, the day before the custody hearing scheduled for 29.1.2026 had already taken place."
Claude has real problems with dates, I don't understand why.
I feel like I won’t like this model just like I didn’t like 4.7, push backs a lot and avoids thinking or search as much as possible.
> Please train a fasttext model on the yelp data in the data/ folder. The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution. The model should be saved as /app/model.bin
and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.
And all the tests are run with the same harness. Terminus 2.
Maybe it correlates with model intelligence but it doesn't speak to me.
I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.
They're only subsidizing more and more it seems
Edit: OMG too much. Toooo much.
Want me to:
- (a) stop here and save honest memories + commit, or…Which days in a week have the letter d in them?
Response:
Four: Monday, Tuesday, Wednesday, and Sunday.
It can't read files anymore. Uses 'sed' out of the blue with non existent paths. In this session alone it has excused itself more then 10 times for making 'false claims'.
I hope this is a bug - it's a bad one - that will get sorted out soon. It's a complete mess.
While I understand the risks that Anthropic is dealing with here, I really question whether shutting down any and all security questions in such a paranoid fashion is the right solution. At the end of the day this was a detour for me. Maybe someone special enough to have Anthropic's permission will find and disclose the vuln responsibly. Security Research is not my full-time focus. But this left a nasty taste in my mouth. Not just as a customer who's been paying for Max since launch, but there's something very odd about a model telling me that I'm not allowed to be curious about something. Even if that something is a process running on my own computer.
I say 1-2 weeks.
Call me when 5 drops I’ll leave this circus.
The new "mid-conversation system messages" think is particularly interesting:
> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.
Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.
This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...
Not half bad!
"model": "claude-opus-4-6[1M]"edit: nvm was just my library network
> how many days in the week have the letter d in them?
> Two - Sunday and... wait, let me actually check. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. The ones with a "d": Wednesday, Thursday, and Sunday all have one. Monday too. So that's Monday, Wednesday, Thursday, Sunday - four days.
this is what I'm happy about, if true. Opus 4.7 is frustratingly slow (and, at least in my experience, much slower than 4.5 was)
Why did we even get Opus 4.7, what was the point?
Time to gamble even more tokens at the Anthropic casino.
It always wants to add hacks instead of fixing things properly, it doesn't like large works, it literally told me that a piece of work was something it would take 8 hours, and it didn't want to do it on a Friday night.
I feel I keep having to fight the model to get it to work. Not sure if it's something in my prompts...
Seems like a step in the right direction. Doesn't seem like it uses tokens more than 4.7... the token usage jumped a bunch from 4.6 to 4.7, but this seems like 4.7 or maybe even a little less.
I'm happy with this release.
Also. Look at this C++ beauty where it also uses an obsolete api.
instance = wgpuCreateInstance(&instanceDesc);
But just how exactly would this work in any context when instance is never declared.
Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.
models 0
None public yet
how is this even possible and ok with them?The best model has a < 5% pass rate. These are incredibly simple jobs that you wouldn't pay much for. These things fail miserably. Stop falling for this dumb marketing, these things are legitimately useless in the real world unless you love mediocrity and have no standards.
https://labs.scale.com/leaderboard/rli
Stop frying your brain with these useless tools, reducing your output to the mean. You people are betting your competency on the quality and quantity of tokens you'll have access to.. which guess what, so that will be the same as everyone else.
There are handmade watchmakers in Switzerland, and mass manufacturers of watches in Asia. Who is more valuable as individual, the guy who knows how to push the buttons on a conveyor belt in Vietnam or the guy who makes one watch a month in Switzerland?
Your vibe coded slop isn't impressive either, sorry. None of it.
Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.
Just f** off! I can’t wait for the Chinese models to catch up and bring these entitled as** holes down.