Claude Opus 4.8 (opens in new tab)

(anthropic.com)

1712 pointscraigmart1d ago1334 comments

1334 comments

A rambling comment:

I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).

So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.

Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

51 more replies

senko1d ago

My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:

https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v

The prompt was: Create a simple but functional real time strategy (RTS) game similar to old WarCraft, StarCraft or Command & Conquer games. The player should be able to build buildings, create units, gather resources and should uncover the whole map. No AI or multiplayer needed. Use simple but nice-looking graphics. No sound. Implement everything in HTML/CSS/JS, everything in a single file (you can use 3rd-party js or css libraries/frameworks via CDN).

19 more replies

colonCapitalDee1d ago

"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

This is a refreshing attitude!

I've also verified that you can now turn off adaptive thinking in the web UI, which is great. I've had a lot of problems with thinking not triggering and the model producing sub-par output. Glad we can finally turn it off. (I hope being able to turn off adaptive thinking is new, if I could have turned it off at any time that would be embarrassing)

15 more replies

northern-lights1d ago

> Not only that, but we plan to release a new class of model with even higher intelligence than Opus. As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview for cybersecurity work. Models of this capability level require stronger cyber safeguards before they can be generally released. We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Probably more interesting than the 4.8 release.

8 more replies

simonw1d ago

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

22 more replies

elAhmo6h ago

As if choosing a model to use on its own is not hard, offering six levels of "effort" (quite a vague term as well), low, medium, high, xhigh, max, ultracode (?!?!) is really making comparisons next to impossible when people using the same model can have vastly different experiences.

What exactly is the diff between high and xhigh? Or xhigh and max? This is definitely too granular and it seems Anthropic took OpenAI's confusion with models as inspiration.

5 more replies

hereme8881d ago

Early ArtificialAnalysis.ai results show GPT 5.5 is still the better bang-for-your-buck.

OpenAI solves tasks with about 50% less output tokens.

https://artificialanalysis.ai/?intelligence=coding-index&int...

6 more replies

onlyrealcuzzo1d ago

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.

6 more replies

epitrochoid4139h ago

Meanwhile Deepseek is cutting inference costs to mere cents. Thats the real AI revolution for you.

1 more reply

gslepak1d ago

On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".

In our work we asked several frontier AIs to come up with an API we needed. We compared Opus 4.7 and GPT-5.5 (among others). Opus 4.7 came up with the most creative and intelligent API design that pleasantly surprised us, especially given that GPT-5.5 was passing it on various coding benchmarks.

What I noticed is that we don't have a commons benchmark to measure "creativity" and "ingenuity", and in some ways such a benchmark would conflict with the common IFBench benchmark. Yet this is a very important skill when designing systems. I'm glad to see Anthropic putting thought into it, and would love to see a public benchmark for this that other models could compare themselves to.

[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...

2 more replies

wg01d ago

There is a hole in the boat's bottom due to Chinese models. They might not be as good but they are not bad either or at least I had hard time finding any issues with Deepseekv4 Flash and Pro variants. They get their job done sometimes rarely giving up till they are done what they are after.

So even for enterprise deployments, as the dust settles down, CFO/CTOs might find out that deploying on an internal cluster of GPUs is far more cheaper and reliable for their organisational needs than paying someone else for burned tokens.

10 more replies

silverlight1d ago

Unfortunately they seem to have straight up broken Claude Code either with this release in the backend or the new CC version. Errors about "can't modify thinking blocks" are bricking long-running sessions: https://github.com/anthropics/claude-code/issues?q=is%3Aissu...

9 more replies

XCSme1d ago

On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7...

I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).

It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

3 more replies

thombles13h ago

Today I was a few hours into chasing down a very tricky timing-dependent bug with GPT 5.5 and we were starting to go into circles. I noticed Opus 4.8 had showed up in GitHub Copilot so I switched over and pointed it at my notes so far. Another hour of steady progress and it tracked it down to some missing synchronisation in an upstream library which was occasionally corrupting a linked list. N=1 but worth every one of those rather expensive 15x requests today. 15x... yeah.

2 more replies

827a1d ago

Frontier models are mostly past the point of human ability to discern whether they are actually better or worse than predecessors and competitors. I suspect the benchmarks may also be saturated, or at least past their usefulness.

I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.

1. Google dropped Gemini 3.5 Flash at IO, delaying the release of 3.5 Pro for a bit (they have said its coming). They also released a refreshed Antigravity, and drew special attention to how cheaply they were able to build their toy operating system to play Doom (less-than $1000 IIRC).

2. OpenAI has dumped everything into Codex, is offering double the token limits for the next few weeks IIRC, and is offering business discounts. Their head of Codex has tweeted that 5.5 is "extremely efficient", implying that they aren't actually losing money on any of this.

3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.

4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.

Anthropic has screwed up where they need to be making investments, and the cracks are starting to show. They've marginally underinvested in the Sonnet line of models for almost a year now, and they've critically underinvested in product. Anthropic made bets on the story of the second half of 2026 being: ultra-frontier, ultra-intelligence. In reality, what's shaping up is that the story will be: Companies rolling back AI spend, efficiency, "95% as good for 15% the price", sophisticated high quality harnesses, cheaper models. Anthropic isn't ready for this world.

13 more replies

pbmango1d ago

I can't help but think of Iphone updates since about 2018. The thinnest, fastest, longest battery life Iphone ever. It seems mostly the same and I probably won't be able to tell other than the name, but everyone buys it anyway.

This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.

5 more replies

dudeinhawaii1d ago

This is the first time I saw a model pop-up on HN and didn't really care. Model exhaustion? It looks interesting but not exciting.

While I'd normally _love_ incremental improvements --- I think the recent ones are far too minor to get excited about or change up a workflow. Besides, benchmarks tend to exaggerate the gap between versions.

At this point I'd almost rather Anthropic wait and really wow us with a 5.0 release -- something that improves across the board, feels less uneven, and is performant enough that people can actually put it through its paces without constantly rationing usage.

4 more replies

dangoodmanUT1d ago

> The Messages API now accepts system entries inside the messages array. Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Biggest deal imo

square_usual1d ago

Buried lede:

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels

1 more reply

SimianSci1d ago

There is an obvious shift in sentiment amongst users, at least here in the US. I feel it myself, even as a proponent of AI tools, the bloviating and language that these companies use in these release articles are starting to wear thin on my patience.

Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.

3 more replies

jkxyz21h ago

My smoke test for new models is to get it to generate a crossword, and this is the first time it's done a good job on the layout:

  ■  S  W  A  M
  B  L  A  M  E
  E  A  G  E  R
  A  T  O  N  E
  M  E  N  D  ■

The full conversation: https://claude.ai/share/60bd0c71-b576-4f8b-a272-ca1af982874c

1 more reply

alansaber1d ago

"Our models are more honest" honey the quarterly marketing spin for a ML term has come. Forget "task alignment" now we're going for "truth index". I suppose this is the only way to generate hype when you're selling/releasing the same product over and over again.

3 more replies

eshack9423h ago

The Claude Pro subscription is basically useless at this point, in terms of usage limits with respect to the settings required to achieve actual useful output.

2 more replies

protoman30001d ago

Opus 4.8 says to take the car. 4.7 said to walk.

“I want to wash my car. The carwash is 50m away. Should I take the car or go by foot?”

https://claude.ai/share/5f7f738a-5f29-48ff-9807-9a2dd37fb405

https://claude.ai/share/ecd14393-9d42-4527-ae0c-89f3d05216c8

1 more reply

setnone1d ago

Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5 there is no way i'm going back

2 more replies

irthomasthomas1d ago

Why does anthropic change the set of benchmarks they use with every new model release?

https://www.anthropic.com/news/claude-opus-4-7

https://www.anthropic.com/news/claude-opus-4-6

1 more reply

Frannky18h ago

I use 4.6, because 4.7 is super lazy, deflects responsibility, and assumes it is good and I am bad, and avoids checking reality. It looks like it's trained on lazy humans instead of good engineers.

Should I try 4.8? I am happy with 4.6. I am not happy with 4.7.

2 more replies

winterbourne10h ago

Interesting to search this page for "4.5".

I'm happy to move to a superior model, but I'm not really hearing enough about significant improvements, and the obvious pressure to release the latest and greatest model makes me hesitant to upgrade. I've been satisfied with the results I get using 4.5 with an "ask ChatGPT" skill that runs the code by ChatGPT 5.4.

1 more reply

conception1d ago

Probably explains why Opus was trash for the last week - https://marginlab.ai/trackers/claude-code/. Curious if the new baseline will rise now in-line with the new benchmarks.

2 more replies

sillyboi10h ago

I just tried Opus 4.8 (Ultracode xhigh + workflows), and it started throwing an error no matter what I sent to the chat: "API Error: 400 message.1.content.4: thinking or redacted_thinking blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response."

1 more reply

ethanpil1d ago

The table comparing eval scores shows the following:

Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%

Then, when you scroll all the way down to the bottom Footnotes section it says

"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."

1 more reply

lordmauve1d ago

Given DeepSWE just blew apart the SWE-Bench Pro benchmark and handed a 14-point lead to GPT-5.5, it looks pretty bad that they've listed SWE-Bench first in the model release and no DeepSWE. Like, this isn't obviously an answer.

Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.

2 more replies

Terretta22h ago

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest

On the contrary, they appear trained to say "Honestly" or "I have to be transparent with you" at inverse proportion to certainty.

Put another way, if they are certain, they don't use "Honestly", and if they are just wrong, or know they don't know, they don't use "Honestly".

They use "honestly" on the bubble, to the degree it's a tell that whatever it's asserting or doing is shakily grounded, sketchy or lazy work, or a host of other reasons you shouldn't trust it.

This training seems instead to be making it performatively punch up claims it cannot substantiate.

redfloatplane1d ago

This made me laugh. Training Opus 4.7 on business skills caused it to sometimes exhibit dishonest behaviour, and not training 4.8 on those skills removed it. From the system card:

> 6.2.5 External testing from Andon Labs Andon Labs reviewed the behavior of Claude Opus 4.8 in their simulated Vending-Bench 2 retail-management evaluation, as reported in the Capabilities section of this system card (see Section 8.13.5). Although they did observe some unexpected capability failures, they did not find clear instances of the kind of concerning in-game behaviors that were discussed in other recent system cards.

> What might have led to these differences? We monitor and investigate the effects of different training environments on alignment; Claude Opus 4.7, for example, had training that focused on business skills and robustness against adversarial agents, but we discovered that this training inadvertently contributed to misaligned behavior including dishonesty. We therefore removed it for Opus 4.8.

> Thus, Opus 4.8 did not show the same misaligned behaviors as Opus 4.7 in Vending-Bench, but also had reduced business success due to being more susceptible to scammers and being less able to negotiate good deals with other agents. We are currently working on training to improve business capabilities while maintaining aligned and ethical behavior.

1 more reply

mesmertech1d ago

/model claude-opus-4-8

seems to work but idk why they never set it so you can see it in the /model list.

"what model are you

I'm Claude Opus (claude-opus-4-8), running in Claude Code."

1 more reply

IFC_LLC1d ago

Ugh...

Invalid request The request couldn't be completed. View details API Error: 400 messages.1.content.7: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.

I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the release. Now 4.8. No difference, same thing.

But the app is broken and nothing works. So now I have to regress to different clients and wait it out while it becomes workable again.

2 more replies

atleastoptimal18h ago

I love how Anthropic gets its employees to talk about enjoying using this model internally when it's likely they're just using Mythos 99% of the time

james_marks1d ago

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest—for instance, to avoid making claims that they can’t support. But a general problem with AI models is that they sometimes jump to conclusions, confidently claiming to have made progress in their work despite the evidence being thin. Early testers report that Opus 4.8 is more likely to flag uncertainties about its work and less likely to make unsupported claims.

Would be awesome if true

8 more replies

rahimnathwani1d ago

Can anyone explain how this is possible?

  Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Does this means the instructions are no longer just something in the early part of the conversation? (If they were, changing them would invalidate the KV cache. no?)

2 more replies

Anonasty15h ago

As long as the token usage is as poor as it has been since march, we don't care about the new bells and whistles.

tarruda1d ago

> One of the most prominent improvements in Opus 4.8 is its honesty.

Does that mean it no longer deletes or changes tests to make it pass?

timbucktwo2h ago

Still not as good as the OG 4.7 that got yanked and re-released with gimp mode enabled.

poink16h ago

I have a relatively large "vibe coded" project that I let Claude 4.5-4.7 drive over the past few months, and my read on it is:

1. It's much more verbose about how it perceives the current state of things, i.e. "this is a large, well-documented project"

2. It's much more willing to trust its own judgement, e.g. fewer prompts to approve decisions

3. In terms of how long it takes to solve isolated problems, and the quality of solutions it proposes, it isn't meaningfully different from 4.7

YMMV, and maybe my view will change as I work with it more, but it feels like system prompt tweaks more than a real step forward

SmithersBot3h ago

Opus work so well for now... until they quantize next week...

cedws1d ago

I'm very suspicious of these same price model launches. It feels like they're benchmaxxed so they can put everyone on them and reduce their compute costs behind the scenes. If the model were genuinely better why wouldn't they charge more for it? Charging the same for something better is a race to the bottom.

Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.

2 more replies

techtuate1d ago

Looking at the comments in this group, I'm not the only "stupid" one who hasn't noticed any discernable improvement in quality across the newer models. In fact my Claude code on re-login switched to Sonnet 4.6 and the vibe coding quality (with Opus 4.7 assisted prompts) has been good enough for me to lazily persevere with Sonnet for coding. Having said that I'm now on Opus 4.8 and will gladly come back here and eat humble pie should my opinion change. PS: Since my goal is embedding the best AI in B2B SAAS products, the key differentiator is not to use the shiniest Claude version (too expensive anyway) but to build a client aware RAG to enable bespoke learning and to use the right AI for my product - a combination of Gemini 3.0 Flash (image and not bad at reasoning), Grok (reasoning) work for me. Would love to hear more ideas (especially on open source as I'll look to cost optimize when I hit scale)

2 more replies

giwook22h ago

The way that Mythos is likely being used to train these publicly available models, I wonder if there will always be a private, mostly/wholly internal model that is significantly ahead technically but is reserved for internal or "VIP" use.

2 more replies

jmward011d ago

Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they are not making money.

2 more replies

gertlabs16h ago

We just finished our initial coding evals of Opus 4.8. Anthropic definitely heard the backlash from Opus 4.7 and they made up for it today.

Subjectively, it's also quite enjoyable to use (although it feels a bit slower on max reasoning), and it's the first Anthropic model that can implement a complex feature without Codex finding 100 bugs.

Data at https://gertlabs.com/rankings

londons_explore1d ago

My guess is anthropic is doing reinforcement learning based on user sessions.

However, doing so relies on the production model staying vaguely close to the model being trained.

To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.

1 more reply

maxloh9h ago

Claude’s reasoning models really impress me as a Gemini user, both in coding tasks and in creative writing for my social science courses.

They are capable of thinking at least 10x longer than Gemini. They can deliberate for five minutes continuously before providing a final, accurate response.

I am currently using the generous free tier of Gemini, but if Gemini offered a similar capability in its paid tier, Google could use better marketing. They should have used a different name to distinguish their premium-only offering.

babelfish1d ago

So GPT 5.6 tomorrow, then?

3 more replies

Spikefu19h ago

I was happily plodding away with it earlier when it threw this out in the middle of a response in Claude code:

--- So — what did you actually see before you hit Ctrl-C? That's the信号 I'm most curious about, and it tells us what to ---

That's the sort of behavior I'd expect from a one or two year old model quantized down to about 1 bit - right word, wrong language in a response. Google translate tells me that's Chinese for signal. I wonder what caused that to happen.

3 more replies

laszlojamf15h ago

I find it freaky how you notice the language change between models. Some words which pop up now all the time, that I don't remember reacting to with previous models, such as "honest(ly)" and "load-bearing". Feels like a new AI smell, like em-dashes or "it's not just x, it's y".

2 more replies

jtrn1d ago

Initial testing feels better than 4.8 And the knowledge cutoff claim of January 2026 seems to check out since it was able to "remember" without search about the double-tap killing of a drug smuggler by the US Army in late December.

generalizations1d ago

Hoping that one day they'll let me go through the identity verification process so I can use it again.

Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.

user-1d ago

Bash(echo "hello"; pwd) ⎿ hello /Users/username/Work/Github/project

Bash(echo test123) ⎿ test123

  Read 1 file, listed 1 directory (ctrl+o to expand)

 Bash(echo "checking output works")
  ⎿  checking output works

  Read 1 file (ctrl+o to expand)
  ⎿  API Error: 400 messages.3.content.56: `thinking`
     or `redacted_thinking` blocks in the latest
     assistant message cannot be modified. These
     blocks must remain as they were in the original
     response.

Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walk

1 more reply

Tenoke1d ago

Claude Code has been wonderful for work and the frequent improvements are nice, although with Mythos being used by others ages ago and new versions for the public still being bellow that, it's hard to not feel like the underclass already.

StanAngeloff8h ago

> [..] Early access users and teams inside Anthropic have been using dynamic workflows for a wide range of use cases [..]

> ### Rewriting Bun with dynamic workflows

> An example of what dynamic workflows can unlock at scale is the recent rewrite of Bun. Jarred Sumner used dynamic workflows to port Bun from Zig to Rust [..]

That's very interesting to hear!

seaal1d ago

https://marginlab.ai/trackers/claude-code/

Is it a coincidence that 4.7 was seemingly quantized over past 7 days?

2 more replies

S-E-P21h ago

I haven't had the best experience with 4.7 and it felt like a substantial debuff. I've even ended up moving a lot of review to codex just because 4.7 was so dense.. Here's to hoping they figured it out since I'm not entirely sure but I would have to guess that they were experimenting with making the model lighter (although I have no concrete evidence of this).

1 more reply

nikolay1d ago

Give us Mythos! This piecemealing doesn't help Anthropic at all, especially psychologically! They are playing a dangerous game, and I see many people leaving Claude Code for good - both due to the subsidy games, and for Anthropic not dogfooding and using unreleased models internally and giving us subpar ones. Benchmarks are nice, but the real-world experience is quite different - neither can you notice these slight improvements, nor are competitors that much worse based on some generic benchmarks.

4 more replies

coppsilgold14h ago

The Opus model as usual impresses. Gave it a paper link with bullet point instructions and constraints (while baiting it to perform some mind reading of my intentions) and it implemented production ready code + the requested attack simulations: <https://gist.github.com/coppsilgold/00d3cd490cb7f8ffc3fe5c1c...>

The subject is Tardos traitor-tracing codes.

winwang1d ago

Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).

clutch891d ago

> One of the most prominent improvements in Opus 4.8 is its honesty

Anthropic talks about their own models as if they're discovering new species in the wild...

12 more replies

lxxpxlxxxx1d ago

My experience with these new releases is that the gains in performance are negated by the price increases and it seems like:

Performance gains: 1.2x Price increases: 1.8x

2 more replies

delis-thumbs-7e1d ago

I won’t change from 4.6. You won’t trick me again.

1 more reply

swader9991d ago

Used it for a couple of long running prompts so far. Had to restart one that bonked on API errors. Of note, I really like the straight forward candor its using. 'More honest' than previous models is playing out in what its saying to me. Telling me straight up where it failed, where gaps are. I like it so far.

skysthelimitt1d ago

when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays

2 more replies

Aldipower10h ago

Claude needs a watch, that's all. Would in itself a 100% improvement.

rkuska1d ago

Thinking on max is broken on 4.8 for me, getting many:

⎿ API Error: 400 messages.1.content.17: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.

From /code-review max.

gadders10h ago

For me n=1 vibe-coding efforts, I found Opus 4.6 better than Opus 4.7. 4.7 seemed to over-reach and go beyond what was requested - adding features I never asked for with no consent.

necrotic_comp1d ago

4.8 also seems like a regression and using it from the chat GUI results in 4.6 no longer showing up. If someone from anthropic is here, is it possible to readd 4.6 in the "other models" dropdown ? I feel like I got a bit baited/switched here.

2 more replies

crambelsoupy19h ago

LGTM. With "ultra" effort Opus 4.8 was able to reproduce and fix a rare bug in our reactive dataflow that has been haunting me for 4 months. I've had >10 attempts to reproduce and fix with Opus 4.7. What made it hard was that it randomly occurred in only a subset of CI runners and never occurred with local testing across multiple machines. It was a real concurrency bug in the core dataflow.

ethanhawksley1d ago

> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%

> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.

Even in the cherry picked benchmarks, they are still cherry picking to make them look good.

aaronblohowiak1d ago

Same price for regular and cheaper fast mode. Happy for these incremental improvements.

GodelNumbering1d ago

> One of the most prominent improvements in Opus 4.8 is its honesty.

I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.

In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.

The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)

1 more reply

wodenokoto12h ago

For white collar “thinking”-tasks what is the top here?

Like, read these documents, fill out these forms and archive it based on some complex, long, domain specific understanding of the categories names.

toephu21d ago

The rapid release cadence and rate of innovation of Anthropic (and OpenAI) is impressive. And obviously it's because these are startups solely dedicated to AI so they can move quickly. Big Tech (like Google) won't be able to keep up with the pace of them (too much bureaucracy and red tape at Google). Classic Innovator's Dilemma. The longer a company exists, the more people, processes, and rules are added, which inevitably slows it down.

Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.

2 more replies

vbezhenar15h ago

Finally I can make it think hard. This is feature I loved in ChatGPT (Pro Mode) and I missed in Claude for so long. Can cancel ChatGPT now, I guess.

Still feels like even with Max mode it doesn't think reasonably long, at least ChatGPT Pro thinks longer.

mattfrommars13h ago

This is incredible. Amazing job Anthropic!

Now when will the innovation happen where say cost of running Haiku performs level of Opus 4.5?

I feel models are only getting bigger instead of models becoming more efficient and cheaper to run

hmokiguess23h ago

They must have been A/B testing this with 4.7 lately, I noticed it changed from its normal mode in a way that matches a lot the just released 4.8

whereistejas22h ago

This may be the most important sentence in that announcement:

> expect to be able to bring Mythos-class models to all our customers in the coming weeks.

rumblefrog1d ago

Wonder if we reached a plateau with the model improvements?

3 more replies

rumblefrog1d ago

Really appreciate the ability to select effort level again.

xintron22h ago

Based on personal experience, seeing how Opus 4.6 still provides better (more nuanced, less totalitarian) answers than 4.7 - it's difficult to get exited for 4.8. Is this another "money grab" from Anthropic? Similar output between 4.6 and 4.7 yet 40x tokens. What's the value proposition from 4.8?

yewenjie1d ago

So Dynamic Workflows is their version of ChatGPT Pro?

1 more reply

tariky1d ago

I believe analogy with smartphone will be best for this case.

In 2010s iphone was the king, all those Chinese devices ware cheaper but not even close to smoothnest and usability of US tech, now after 15 years later everything is changed, now iphone feels like old grandpa to Chinese tech. Same will happend to LLM's just much faster.

1 more reply

ramon15613h ago

I love how they will always have *one metric that is lower than a competitor's model, like these metrics are reflecting usage.

imagetic23h ago

I used to think it was a big deal when a HN post had more than 500 comments.

Now it’s every day. Like billion dollar evaluations.

samuelknight1d ago

It feels noticeably sharper than Opus 4.7

ropintus1d ago

Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?

5 more replies

rsanek1d ago

> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Excited to see what this model looks like.

throwaway6774321h ago

Question is, can it understand dates now? Example just now:

"The PO application was filed on 23.2.2026, the day before the custody hearing scheduled for 29.1.2026 had already taken place."

Claude has real problems with dates, I don't understand why.

sbochins2h ago

I haven’t done any coding or anything that would use a lot of tokens and somehow I’ve already hit my session limit with my $20 plan. I’m just using it to ask basic questions most of the time and occasionally I have it write code but I haven’t done anything like that since the new model rolled out. It looks like some sort of issue where they’re incorrectly capping things for people?

assorium23h ago

It refused to work for me. Literally said, you can google it. AGI achieved it seems

devilfileprong1h ago

The moremi to derivative giraffe,can Face ID to q, another guy in eT-Shirt(Arabb LOEKE))

antirez1d ago

Anthropic did a big strategic error. Normally they compare their models with their old models. Instead today, now that everybody knows how strong GPT 5.5 is at coding, they put it in the mix, basically showing all their customers that the benchmarks can't be trusted.

2 more replies

ismailmaj22h ago

I just asked the model details about the incoming spaceX IPO and it responded with “There’s no confirmed SpaceX IPO. Elon Musk has said for years that SpaceX itself won’t go public”. It took me two push backs and specifically asking for web search.

I feel like I won’t like this model just like I didn’t like 4.7, push backs a lot and avoids thinking or search as much as possible.

drchaim11h ago

i just want to use anthropic models under subscription with other agents!

1 more reply

mistic921d ago

Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now

siwakotisaurav1d ago

Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to

3 more replies

Alex_toani16h ago

I have try the 4.8. With Ultra coding. I think the output of the agent is more structured. Better than just filling all the thing.

Topology118h ago

Haven't tried it in Claude Code yet, but I would say over on claude.ai it is noticeably better at following instructions.

robertkarl1d ago

I can't get excited about these benchmarks they're leading with. I've looked at the Terminal-Bench questions and I just think they're irrelevant. And SWE-Bench has serious flaws, even the big boys say so: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...

> Please train a fasttext model on the yelp data in the data/ folder. The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution. The model should be saved as /app/model.bin

and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.

And all the tests are run with the same harness. Terminus 2.

Maybe it correlates with model intelligence but it doesn't speak to me.

I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.

1 more reply

m10118h ago

Anthropic killing headless usage in their plans on June 15th pushed me to codex. I heard there’s a tmux work around though.

Venkatesh1021h ago

I found the update to be extremely judgemental in the model bias. Plus it's making silly mistakes which I've never seen in any Claude model since 3.5.

2001zhaozhao1d ago

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.

They're only subsidizing more and more it seems

1 more reply

worldsavior1d ago

Seems like from now on the updates will be a minor upgrade from previous models.

lostdog1d ago

I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.

1 more reply

cgg123h ago

I find it surprising that the gap between tool usage and non-tool usage in HLE is relatively small (~10%) but the absolute numbers continue to go up

jen729w17h ago

Half an hour in and I'm already thoroughly sick of "look I need to be honest with you here…"

Edit: OMG too much. Toooo much.

    Want me to:
    - (a) stop here and save honest memories + commit, or…

triklozoid1d ago

Subscription still doesn't work with pi, so totally useless..

offaxis13h ago

I am still using GPT 5.5. Should I switch back to the Claude now?

JimmyElm16h ago

It's more fast to response, but I really wanna it think more before response.

dt3ft13h ago

Opus 4.8:

Which days in a week have the letter d in them?

Response:

Four: Monday, Tuesday, Wednesday, and Sunday.

2 more replies

myworkaccount21d ago

Anyone else experiencing tool call failures? Switch back to 4.7, same prompt, same everything it works with no problems.

pqdbr21h ago

At lest for me, it's a disaster. It's like we're back to GPT-2 era.

It can't read files anymore. Uses 'sed' out of the blue with non existent paths. In this session alone it has excused itself more then 10 times for making 'false claims'.

I hope this is a bug - it's a bad one - that will get sorted out soon. It's a complete mess.

atentaten1d ago

At least it passes the Car Wash Test this time.

1 more reply

pedro99913h ago

Maybe it's just me but whenever a new model comes out, I feel an instant boost in productivity. Probably just a placebo?

PowerElectronix11h ago

It looks like there's no more juice to squeeze out of LLMs. Will they keep throwing billions in hardware and power to the problem?

bonoboTP1d ago

It's making stupid flowcharts in the web chat interface with boxes and arrows, embedded in the response. Annoying.

bryceneal19h ago

I guess Opus makes it impossible to do anything vaguely resembling security research. By chance I stumbled into an ACE for some software I had installed on my local machine after observing a strange crash. I figured I would take the time to investigate (so as to actually deeply understand what was happening myself and avoid throwing yet another hallucinated slop disclosure over the fence if it came to that), but I was completely locked out by Opus. I tried applying to their "Cyber Verification Program", but was effectively instantly denied in a way that was probably automated.

While I understand the risks that Anthropic is dealing with here, I really question whether shutting down any and all security questions in such a paranoid fashion is the right solution. At the end of the day this was a detour for me. Maybe someone special enough to have Anthropic's permission will find and disclose the vuln responsibly. Security Research is not my full-time focus. But this left a nasty taste in my mouth. Not just as a customer who's been paying for Max since launch, but there's something very odd about a model telling me that I'm not allowed to be curious about something. Even if that something is a process running on my own computer.

Px-Jebaseelan8h ago

It's Gonna Eat all of my tokens in one response :(

rjhy20201d ago

OK finally Claude code is better than codex

1 more reply

motoxpro17h ago

The workflow/ultracode mode is absolutely unbelievable.

novia17h ago

got a random pair up with this model on lmarena. it was outperformed by gemma-4-31b. suffice to say i'm not impressed (or maybe i am impressed with gemma?)

NanoWar1d ago

Just show me the pelican, ah wait we are past pelicans. Can we get something like that ever again?

alasano1d ago

Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.

hereme88816h ago

Any bets on how long now until GPT-5.6 announced on HN?

I say 1-2 weeks.

nullbio15h ago

Still not worth the cost over GPT 5.5. Anthropic better start improving their speed+costs, or they're going to lose an incredible amount of business. And no, fast mode is not something any sane person will ever use. 6x the cost for 2.5x the speed, what a joke...

1 more reply

maxloh1d ago

Anthropic also resets my usage limits (I am in the Pro plan). That's very kind of them :)

matheusmoreira1d ago

Can I disable adaptive thinking? If not, I'm gonna keep using 4.6 as my default.

brap1d ago

Oof, this one is a major blabber.

user284020h ago

Thanks for sharing this update on Claude Opus 4.8! It's great to see Anthropic continuing to improve their models. Looking forward to trying out the new capabilities.

jruz14h ago

Don’t even bother checking this minor PR bumps, it’s all a show, degradation then bump to the previous state.

Call me when 5 drops I’ll leave this circus.

mincer_ray1d ago

seems like a really minor upgrade?

4 more replies

mophose21h ago

next (or maybe current) frontier of competition may not be the model, rather the harness and how much unique advantage a lab-created harness can beat 3rd-party harness.

Eric_Bulai1d ago

I don't know why the world is so happy about this when we should actually say stop.

1 more reply

simonw1d ago

They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...

The new "mid-conversation system messages" think is particularly interesting:

> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.

Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.

This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...

1 more reply

docheinestages1d ago

All I need for Christmas is a Claude that doesn't spit out so many em dashes.

1 more reply

hnroo991d ago

Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp

Not half bad!

2 more replies

dispencer1d ago

The smarter the model the better querybear gets. I'm happy with that.

vunderba1d ago

I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.

  "model": "claude-opus-4-6[1M]"

2 more replies

RayVR18h ago

I have been using opus 4.8 all morning and this is honestly the most sycophantic, ChatGPT like experience I have had from Anthropic. Very concerning.

carlos-menezes1d ago

I, for lack of a better word, dislike anyone who anthropomorphizes AI.

5 more replies

baroiall1d ago

Hot danm, cant wait to reach my token limit with the new LLM

willsmith7219h ago

anyone else's claude code (native install) not able to update to 2.1.154 to get 4.8?

edit: nvm was just my library network

sourcecodeplz1d ago

From the release it seems we will also get Mythos pretty soon.

plumocracy1d ago

Numbers looking good. We'll see how it actually performs.

1 more reply

sMarsIntruder15h ago

Opus 4.8 - High

> how many days in the week have the letter d in them?

> Two - Sunday and... wait, let me actually check. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. The ones with a "d": Wednesday, Thursday, and Sunday all have one. Monday too. So that's Monday, Wednesday, Thursday, Sunday - four days.

1 more reply

nickstinemates15h ago

Rollout has been a little suspect. Hope it gets better.

1 more reply

hatefulheart13h ago

Oh my god! This model is incredible! A massive leap for humanity!

lylo1d ago

2 hours after I fork out for Codex Pro… :-|

1 more reply

s-a-p1d ago

Has anyone else experienced quality degradation in CC (opus 4.7) these past few days? I've been getting some truly crappy slop which makes me think they nerf the existing model when they're about to release a new one. Of course this is based off of pure vibes

1970-01-011d ago

Can anyone else see these X.Y updates aren't meeting the outrageous AI expectations that we were told we would see just a year ago?

3 more replies

DeathArrow11h ago

How many kidneys do you have to sell? Are 2 enough?

lukaslalinsky1d ago

I've said it before, but I don't like Opus past version 4.5. It became unresponsive, thinking for too long without feedback, sometimes seemingly getting stuck. I guess it might be marginally better for some benchmarks, but when using it as coding assistant, the new models are worse. Even the new Sonnet versions do that. I'm slowly getting used to Haiku-level LLMs with the hope to run it locally at some point. It's less autonomous, but maybe that's for the best.

insane_dreamer1d ago

> And fast mode for Opus 4.8—where the model can work at 2.5× the speed—is now three times cheaper than it was for previous models.

this is what I'm happy about, if true. Opus 4.7 is frustratingly slow (and, at least in my experience, much slower than 4.5 was)

blurbleblurble19h ago

4.7 broke my trust

iLemming1d ago

These models starting to feel like Windows versions. Windows 95 was a promising start, but buggy. Windows ME was a disaster. Windows XP was good, but slightly buggy. Windows Vista was a bloated disaster. Windows 7 - refined, but still buggy; Windows 8 - weird and buggy; Windows 10 - solid workhorse, still fucking buggy. Windows 11 - pretty, but not sure why does it even exist.

Why did we even get Opus 4.7, what was the point?

rvz1d ago

Anthropic has now upgraded their Claude slot machine to version 4.8.

Time to gamble even more tokens at the Anthropic casino.

1 more reply

iamsaitam22h ago

let me guess, "this is our best model yet"

noncoml14h ago

I don't know what's going on lately but Opus is extremely lazy for me...

It always wants to add hacks instead of fixing things properly, it doesn't like large works, it literally told me that a piece of work was something it would take 8 hours, and it didn't want to do it on a Friday night.

I feel I keep having to fight the model to get it to work. Not sure if it's something in my prompts...

saaaaaam1d ago

I hope this fixes the absolute shitshow that is 4.7 and its awful “adaptive reasoning”. I tried that a few times then reverted to 4.6.

dbg3141517h ago

First impression... this catches issues that 4.7 missed, which caught issues that 4.6 missed... which caught issues that 4.5 missed...

Seems like a step in the right direction. Doesn't seem like it uses tokens more than 4.7... the token usage jumped a bunch from 4.6 to 4.7, but this seems like 4.7 or maybe even a little less.

I'm happy with this release.

firemelt1d ago

how about the bencmarks what effort did it use?

docmars21h ago

So, has it replaced the entire startup yet?

m3kw921h ago

This is Anthropic's 5.5

lidg3ai13h ago

4.6 is better

HlessClaudesman1d ago

If this model is more honest, it must be honestly praising my efforts every first sentence.

1 more reply

sgt1d ago

Interesting, I've been using 4.7 since it came out and it was pretty good for me. But in the last day or so it turned dumb. Is this normal just before they release a new one?

AtNightWeCode1d ago

Complete garbage. error, error, error. Still lags several versions behind on API:s. Can't even get any info on the model. Guessing not from this year.

Also. Look at this C++ beauty where it also uses an obsolete api.

instance = wgpuCreateInstance(&instanceDesc);

But just how exactly would this work in any context when instance is never declared.

itrunsdoomguy8h ago

Meh, it’s not able to play Doom.

maltemalte1d ago

"We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks."

catigula1d ago

AGI post-poned?

zb31d ago

Did they reduce security research capabilities even further with this release? (they did it for opus 4.7)

guluarte1d ago

so it is worse than gpt 5.5 for coding?

2 more replies

behnamoh1d ago

> As always, we ran a detailed alignment assessment on the model before release. In terms of positive traits, our Alignment team concluded that Opus 4.8 “reaches new highs on our measures of prosocial traits like supporting user autonomy and acting in the user’s best interest.” The assessment also showed Opus 4.8 to have rates of misaligned behavior (such as deception or cooperation with misuse) that are substantially lower than Opus 4.7, and similar to our best-aligned model, Claude Mythos Preview. The full alignment assessment, accompanied by a suite of pre-deployment safety tests, is reported in the Claude Opus 4.8 System Card.

Controversial opinion, but I actually _like_ a model that can deceive me, that actually is a sign of intelligence, and is different from hallucination. When companies say their model is more "aligned", I automatically think they mean it's more censored.

1 more reply

impulser_1d ago

Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.

2 more replies

AbuAssar1d ago

Gemini pro is embarrassing

ionwake23h ago

Im tired boss, I'm already being perfectly gaslit by the current models.

NSCaffeine22h ago

Had a feeling this was coming as in the past week 4.7 started to get dumb.

vb-84481d ago

Now i get why in the last days claude code limits were lasting few prompts ...

stainablesteel23h ago

i'm beginning to find it comical how every model release always presents itself as superior to every other model on the market, but they always leave just one test where some other model was modestly better, just in case.

thibran1d ago

Nice, now make it 20x cheaper.

1 more reply

Marciplan1d ago

Lol you still use GPT 5.5 bro we’re all back on Opus 4.8!

deadbabe1d ago

Looking forward to people saying how it’s actually shittier and they’re going back to [some earlier cheaper model]

1 more reply

diimdeep16h ago

It is bananas that with supposed $965B valuation this Org to this day https://huggingface.co/Anthropic

  models 0
  None public yet

how is this even possible and ok with them?

damsta23h ago

Meh

firemelt1d ago

what a fucking frontier!

McDownloads1d ago

Disappointed to say the least.

dakolli1d ago

Reminder the only benchmark that really matters is the one that measures the ability for the model to do real world tasks that someone would pay for on Upwork that would take ~12 hrs for a human to do.

The best model has a < 5% pass rate. These are incredibly simple jobs that you wouldn't pay much for. These things fail miserably. Stop falling for this dumb marketing, these things are legitimately useless in the real world unless you love mediocrity and have no standards.

https://labs.scale.com/leaderboard/rli

Stop frying your brain with these useless tools, reducing your output to the mean. You people are betting your competency on the quality and quantity of tokens you'll have access to.. which guess what, so that will be the same as everyone else.

There are handmade watchmakers in Switzerland, and mass manufacturers of watches in Asia. Who is more valuable as individual, the guy who knows how to push the buttons on a conveyor belt in Vietnam or the guy who makes one watch a month in Switzerland?

Your vibe coded slop isn't impressive either, sorry. None of it.

1 more reply

ecommerceguy23h ago

yawn

uejfiweun1d ago

Yesssss dude!

Claude Opus 4.7 is literally the smartest entity I've ever interacted with. Well done to you geniuses at Anthropic. Can't wait to interact with 4.8.

brandnewideas1d ago

Really wish these slop announcements stopped hitting the front page. It's the exact same thing every time. X bumped from N.Y to N.Y+1. wow

ramcsamal19h ago

Great

keybored1d ago

I’ve been [stock market phrase] on machine learning since I dropped out of my graduate degree at [Ivy League] to distance myself from the Logic AI Winter. But this Spring I decided to spend some of my [portfolio speak/pocket change] on a MacBook Ultra. Okay okay, I felt it, I definitely felt the human-machine synergies. We’re out of the Winter, boys. That’s what I thought two weeks ago. Then I felt bored in between blood transfusions and found out that Claude subscriptions has increased 50%. Finally it costs enough for me to justify spending a minute thinking about trying it out. Then I didn’t try it out. It tried me out. My hairs were standing on end. My hands were shaking. Eventually I couldn’t even type, I was so ramped up on cortisol. I had to switch to voice commands. Mr. Claude took me through 8, eight, bespoke dashboard and report systems. Animated. Graphs shooting up. Plugged right into my business ape ee eyes I think. I was crying, euphoric at the machine-synergy happening right in front of my FACE. RIGHT THERE, RIGHT THEN. Then my nurse said that I passed out. I swear that I didn’t. I was totally lucid, but in another world. I was inside the machine. Inside DOS, the machine brain stem. A business man approached me. The most handsome board member kind of apparition that I have seen. And he was built something different. Square jaw, absolute massive build. Like Arnold Schwarzenegger. But like he knew business through and through. Not that he spent hours in the gym or nonsense like that. Like he had found a body surrogate technology. And his nameplate? “Claude For Business” He winked. “Hey there, Fitzpatrick–Goldworth.” No one but my daddy has ever called me that. “Want to get started... stakeholder?” My nurse said that my crying in this lucid state depleted most of my fluids and minerals. Needless to say layoffs were announced the next day.

DGAP1d ago

I actually liked not having to choose the effort level for conversational usage, this feels like a step backwards.

irthomasthomas1d ago

How did this youtuber know? https://xcancel.com/rileybrown/status/2059823372914073809?s=...

1 more reply

thefounder1d ago

>> As part of Project Glasswing, a small number of organizations are currently using Claude Mythos Preview

Just f** off! I can’t wait for the Chinese models to catch up and bring these entitled as** holes down.

1 more reply

j / k navigate · click thread line to collapse

1334 comments

NiloCK1d ago

A rambling comment:

Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.

But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.

51 more replies

senko1d ago

My fav coding benchmark for frontier models is to build a simple RTS game in one file (js/html/css). Claude Code with Opus 4.8 in ultracode mode nailed it, the best result so far:

https://bsky.app/profile/senko.net/post/3mmwnrkwboc2v

19 more replies

colonCapitalDee1d ago

"Users will find Opus 4.8 to be a modest but tangible improvement on its predecessor."

This is a refreshing attitude!

15 more replies

northern-lights1d ago

Probably more interesting than the 4.8 release.

8 more replies

simonw1d ago

I generated pelicans riding bicycles on both thinking level low and thinking level high:

https://gist.github.com/simonw/68560eddb0b268a8417f80ceb7304...

The high one is notably better - the bicycle frame is the correct shape, unlike thinking level low.

For comparison, here's Opus 4.7: https://gist.github.com/simonw/afcb19addf3f38eb1996e1ebe749c...

22 more replies

elAhmo6h ago

What exactly is the diff between high and xhigh? Or xhigh and max? This is definitely too granular and it seems Anthropic took OpenAI's confusion with models as inspiration.

5 more replies

hereme8881d ago

Early ArtificialAnalysis.ai results show GPT 5.5 is still the better bang-for-your-buck.

OpenAI solves tasks with about 50% less output tokens.

https://artificialanalysis.ai/?intelligence=coding-index&int...

6 more replies

onlyrealcuzzo1d ago

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

6 more replies

epitrochoid4139h ago

Meanwhile Deepseek is cutting inference costs to mere cents. Thats the real AI revolution for you.

1 more reply

gslepak1d ago

On page 102 of the system card [1] I'm pleased to see evaluation against "creative mastery".

[1] https://cdn.sanity.io/files/4zrzovbb/website/c886650a2e96fc0...

2 more replies

wg01d ago

10 more replies

silverlight1d ago

9 more replies

XCSme1d ago

On my tests[0] it does a bit worse, and it's almost 2x expensive than Opus 4.7...

I was surprised to see that it failed a Data extraction test (it gets it right 2/3 times, but one time it randomly returns null for a value instead).

It makes sense a bit that it fails more Trivia/Domain-specific knowledge tasks (I think models are more and more trained towards agentic use-case than general intelligence).

[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

3 more replies

thombles13h ago

2 more replies

827a1d ago

I personally feel that Anthropic doesn't understand what this means for the frontier labs, and moreover that they might be the only frontier lab that doesn't.

3. DeepSeek and other Chinese labs have dropped token pricing to the floor, in some situations as much as 99%.

4. Anthropic releases the next generation of Opus, their most expensive public model, without changing its price. In the background, they hype up Mythos, an even more expensive model.

13 more replies

pbmango1d ago

This is good psychology for the labs. When Buffett invested in Apple he loved citing how most people would rather give up their second car than their Iphone.

5 more replies

dudeinhawaii1d ago

This is the first time I saw a model pop-up on HN and didn't really care. Model exhaustion? It looks interesting but not exciting.

4 more replies

dangoodmanUT1d ago

Biggest deal imo

square_usual1d ago

Buried lede:

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels

1 more reply

SimianSci1d ago

Its possible we might just be witnessing a shift in fashion, where this type of sentimentality was more acceptable when it was novel and new, but now it just appears out of touch.

3 more replies

jkxyz21h ago

My smoke test for new models is to get it to generate a crossword, and this is the first time it's done a good job on the layout:

  ■  S  W  A  M
  B  L  A  M  E
  E  A  G  E  R
  A  T  O  N  E
  M  E  N  D  ■

The full conversation: https://claude.ai/share/60bd0c71-b576-4f8b-a272-ca1af982874c

1 more reply

alansaber1d ago

3 more replies

eshack9423h ago

The Claude Pro subscription is basically useless at this point, in terms of usage limits with respect to the settings required to achieve actual useful output.

2 more replies

protoman30001d ago

Opus 4.8 says to take the car. 4.7 said to walk.

“I want to wash my car. The carwash is 50m away. Should I take the car or go by foot?”

https://claude.ai/share/5f7f738a-5f29-48ff-9807-9a2dd37fb405

https://claude.ai/share/ecd14393-9d42-4527-ae0c-89f3d05216c8

1 more reply

setnone1d ago

Claude's 4.6 - 4.7 transition made me discover codex, and with gpt 5.5 there is no way i'm going back

2 more replies

irthomasthomas1d ago

Why does anthropic change the set of benchmarks they use with every new model release?

https://www.anthropic.com/news/claude-opus-4-7

https://www.anthropic.com/news/claude-opus-4-6

1 more reply

Frannky18h ago

I use 4.6, because 4.7 is super lazy, deflects responsibility, and assumes it is good and I am bad, and avoids checking reality. It looks like it's trained on lazy humans instead of good engineers.

Should I try 4.8? I am happy with 4.6. I am not happy with 4.7.

2 more replies

winterbourne10h ago

Interesting to search this page for "4.5".

1 more reply

conception1d ago

Probably explains why Opus was trash for the last week - https://marginlab.ai/trackers/claude-code/. Curious if the new baseline will rise now in-line with the new benchmarks.

2 more replies

sillyboi10h ago

1 more reply

ethanpil1d ago

The table comparing eval scores shows the following:

Agentic Terminal Coding (Terminal-Bench 2.1) Opus 4.8 74.6% GPT 5.5 78.2%

Then, when you scroll all the way down to the bottom Footnotes section it says

"Terminal-Bench 2.1: We reported scores for all models using the Terminus-2 public harness. GPT-5.5’s reported score with the Codex CLI harness is 83.4%."

1 more reply

lordmauve1d ago

Or maybe it is, but publish the DeepSWE numbers so we can see for ourselves.

2 more replies

Terretta22h ago

> One of the most prominent improvements in Opus 4.8 is its honesty. We train all our models to be honest

On the contrary, they appear trained to say "Honestly" or "I have to be transparent with you" at inverse proportion to certainty.

Put another way, if they are certain, they don't use "Honestly", and if they are just wrong, or know they don't know, they don't use "Honestly".

They use "honestly" on the bubble, to the degree it's a tell that whatever it's asserting or doing is shakily grounded, sketchy or lazy work, or a host of other reasons you shouldn't trust it.

This training seems instead to be making it performatively punch up claims it cannot substantiate.

redfloatplane1d ago

This made me laugh. Training Opus 4.7 on business skills caused it to sometimes exhibit dishonest behaviour, and not training 4.8 on those skills removed it. From the system card:

1 more reply

mesmertech1d ago

/model claude-opus-4-8

seems to work but idk why they never set it so you can see it in the /model list.

"what model are you

I'm Claude Opus (claude-opus-4-8), running in Claude Code."

1 more reply

IFC_LLC1d ago

Ugh...

I would rather not. 4.6 was fine. 4.7 got to be fine 1 week after the release. Now 4.8. No difference, same thing.

But the app is broken and nothing works. So now I have to regress to different clients and wait it out while it becomes workable again.

2 more replies

atleastoptimal18h ago

I love how Anthropic gets its employees to talk about enjoying using this model internally when it's likely they're just using Mythos 99% of the time

james_marks1d ago

Would be awesome if true

8 more replies

rahimnathwani1d ago

Can anyone explain how this is possible?

  Developers can update Claude’s instructions mid-task without breaking the prompt cache or routing the update through a user turn. This can be used in a given harness to update permissions, token budgets, or environment context as an agent runs.

Does this means the instructions are no longer just something in the early part of the conversation? (If they were, changing them would invalidate the KV cache. no?)

2 more replies

Anonasty15h ago

As long as the token usage is as poor as it has been since march, we don't care about the new bells and whistles.

tarruda1d ago

> One of the most prominent improvements in Opus 4.8 is its honesty.

Does that mean it no longer deletes or changes tests to make it pass?

timbucktwo2h ago

Still not as good as the OG 4.7 that got yanked and re-released with gimp mode enabled.

poink16h ago

I have a relatively large "vibe coded" project that I let Claude 4.5-4.7 drive over the past few months, and my read on it is:

1. It's much more verbose about how it perceives the current state of things, i.e. "this is a large, well-documented project"

2. It's much more willing to trust its own judgement, e.g. fewer prompts to approve decisions

3. In terms of how long it takes to solve isolated problems, and the quality of solutions it proposes, it isn't meaningfully different from 4.7

YMMV, and maybe my view will change as I work with it more, but it feels like system prompt tweaks more than a real step forward

SmithersBot3h ago

Opus work so well for now... until they quantize next week...

cedws1d ago

Opus 4.7 wasn't noticably any better for me, I still use 4.6 because it's cheaper.

2 more replies

techtuate1d ago

2 more replies

giwook22h ago

2 more replies

jmward011d ago

Meanwhile haiku is on 4.5 and sonnet is on 4.6. It is clear where they are not making money.

2 more replies

gertlabs16h ago

We just finished our initial coding evals of Opus 4.8. Anthropic definitely heard the backlash from Opus 4.7 and they made up for it today.

Data at https://gertlabs.com/rankings

londons_explore1d ago

My guess is anthropic is doing reinforcement learning based on user sessions.

However, doing so relies on the production model staying vaguely close to the model being trained.

To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.

1 more reply

maxloh9h ago

Claude’s reasoning models really impress me as a Gemini user, both in coding tasks and in creative writing for my social science courses.

They are capable of thinking at least 10x longer than Gemini. They can deliberate for five minutes continuously before providing a final, accurate response.

babelfish1d ago

So GPT 5.6 tomorrow, then?

3 more replies

Spikefu19h ago

I was happily plodding away with it earlier when it threw this out in the middle of a response in Claude code:

--- So — what did you actually see before you hit Ctrl-C? That's the信号 I'm most curious about, and it tells us what to ---

3 more replies

laszlojamf15h ago

2 more replies

jtrn1d ago

generalizations1d ago

Hoping that one day they'll let me go through the identity verification process so I can use it again.

Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.

user-1d ago

Bash(echo "hello"; pwd) ⎿ hello /Users/username/Work/Github/project

Bash(echo test123) ⎿ test123

  Read 1 file, listed 1 directory (ctrl+o to expand)

 Bash(echo "checking output works")
  ⎿  checking output works

  Read 1 file (ctrl+o to expand)
  ⎿  API Error: 400 messages.3.content.56: `thinking`
     or `redacted_thinking` blocks in the latest
     assistant message cannot be modified. These
     blocks must remain as they were in the original
     response.

Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walk

1 more reply

Tenoke1d ago

StanAngeloff8h ago

> [..] Early access users and teams inside Anthropic have been using dynamic workflows for a wide range of use cases [..]

> ### Rewriting Bun with dynamic workflows

> An example of what dynamic workflows can unlock at scale is the recent rewrite of Bun. Jarred Sumner used dynamic workflows to port Bun from Zig to Rust [..]

That's very interesting to hear!

seaal1d ago

https://marginlab.ai/trackers/claude-code/

Is it a coincidence that 4.7 was seemingly quantized over past 7 days?

2 more replies

S-E-P21h ago

1 more reply

nikolay1d ago

4 more replies

coppsilgold14h ago

The subject is Tardos traitor-tracing codes.

winwang1d ago

Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).

clutch891d ago

> One of the most prominent improvements in Opus 4.8 is its honesty

Anthropic talks about their own models as if they're discovering new species in the wild...

12 more replies

lxxpxlxxxx1d ago

My experience with these new releases is that the gains in performance are negated by the price increases and it seems like:

Performance gains: 1.2x Price increases: 1.8x

2 more replies

delis-thumbs-7e1d ago

I won’t change from 4.6. You won’t trick me again.

1 more reply

swader9991d ago

skysthelimitt1d ago

when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays

2 more replies

Aldipower10h ago

Claude needs a watch, that's all. Would in itself a 100% improvement.

rkuska1d ago

Thinking on max is broken on 4.8 for me, getting many:

⎿ API Error: 400 messages.1.content.17: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.

From /code-review max.

gadders10h ago

For me n=1 vibe-coding efforts, I found Opus 4.6 better than Opus 4.7. 4.7 seemed to over-reach and go beyond what was requested - adding features I never asked for with no consent.

necrotic_comp1d ago

2 more replies

crambelsoupy19h ago

ethanhawksley1d ago

> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%

> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.

Even in the cherry picked benchmarks, they are still cherry picking to make them look good.

aaronblohowiak1d ago

Same price for regular and cheaper fast mode. Happy for these incremental improvements.

GodelNumbering1d ago

> One of the most prominent improvements in Opus 4.8 is its honesty.

I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.

The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)

1 more reply

wodenokoto12h ago

For white collar “thinking”-tasks what is the top here?

Like, read these documents, fill out these forms and archive it based on some complex, long, domain specific understanding of the categories names.

toephu21d ago

Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.

2 more replies

vbezhenar15h ago

Finally I can make it think hard. This is feature I loved in ChatGPT (Pro Mode) and I missed in Claude for so long. Can cancel ChatGPT now, I guess.

Still feels like even with Max mode it doesn't think reasonably long, at least ChatGPT Pro thinks longer.

mattfrommars13h ago

This is incredible. Amazing job Anthropic!

Now when will the innovation happen where say cost of running Haiku performs level of Opus 4.5?

I feel models are only getting bigger instead of models becoming more efficient and cheaper to run

hmokiguess23h ago

They must have been A/B testing this with 4.7 lately, I noticed it changed from its normal mode in a way that matches a lot the just released 4.8

whereistejas22h ago

This may be the most important sentence in that announcement:

> expect to be able to bring Mythos-class models to all our customers in the coming weeks.

rumblefrog1d ago

Wonder if we reached a plateau with the model improvements?

3 more replies

rumblefrog1d ago

Really appreciate the ability to select effort level again.

xintron22h ago

yewenjie1d ago

So Dynamic Workflows is their version of ChatGPT Pro?

1 more reply

tariky1d ago

I believe analogy with smartphone will be best for this case.

1 more reply

ramon15613h ago

I love how they will always have *one metric that is lower than a competitor's model, like these metrics are reflecting usage.

imagetic23h ago

I used to think it was a big deal when a HN post had more than 500 comments.

Now it’s every day. Like billion dollar evaluations.

samuelknight1d ago

It feels noticeably sharper than Opus 4.7

ropintus1d ago

Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?

5 more replies

rsanek1d ago

> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.

Excited to see what this model looks like.

throwaway6774321h ago

Question is, can it understand dates now? Example just now:

"The PO application was filed on 23.2.2026, the day before the custody hearing scheduled for 29.1.2026 had already taken place."

Claude has real problems with dates, I don't understand why.

sbochins2h ago

assorium23h ago

It refused to work for me. Literally said, you can google it. AGI achieved it seems

devilfileprong1h ago

The moremi to derivative giraffe,can Face ID to q, another guy in eT-Shirt(Arabb LOEKE))

antirez1d ago

2 more replies

ismailmaj22h ago

I feel like I won’t like this model just like I didn’t like 4.7, push backs a lot and avoids thinking or search as much as possible.

drchaim11h ago

i just want to use anthropic models under subscription with other agents!

1 more reply

mistic921d ago

Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now

siwakotisaurav1d ago

Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to

3 more replies

Alex_toani16h ago

I have try the 4.8. With Ultra coding. I think the output of the agent is more structured. Better than just filling all the thing.

Topology118h ago

Haven't tried it in Claude Code yet, but I would say over on claude.ai it is noticeably better at following instructions.

robertkarl1d ago

and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.

And all the tests are run with the same harness. Terminus 2.

Maybe it correlates with model intelligence but it doesn't speak to me.

I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.

1 more reply

m10118h ago

Anthropic killing headless usage in their plans on June 15th pushed me to codex. I heard there’s a tmux work around though.

Venkatesh1021h ago

I found the update to be extremely judgemental in the model bias. Plus it's making silly mistakes which I've never seen in any Claude model since 3.5.

2001zhaozhao1d ago

> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.

They're only subsidizing more and more it seems

1 more reply

worldsavior1d ago

Seems like from now on the updates will be a minor upgrade from previous models.

lostdog1d ago

1 more reply

cgg123h ago

I find it surprising that the gap between tool usage and non-tool usage in HLE is relatively small (~10%) but the absolute numbers continue to go up

jen729w17h ago

Half an hour in and I'm already thoroughly sick of "look I need to be honest with you here…"

Edit: OMG too much. Toooo much.

    Want me to:
    - (a) stop here and save honest memories + commit, or…

triklozoid1d ago

Subscription still doesn't work with pi, so totally useless..

offaxis13h ago

I am still using GPT 5.5. Should I switch back to the Claude now?

JimmyElm16h ago

It's more fast to response, but I really wanna it think more before response.

dt3ft13h ago

Opus 4.8:

Which days in a week have the letter d in them?

Response:

Four: Monday, Tuesday, Wednesday, and Sunday.

2 more replies

myworkaccount21d ago

Anyone else experiencing tool call failures? Switch back to 4.7, same prompt, same everything it works with no problems.

pqdbr21h ago

At lest for me, it's a disaster. It's like we're back to GPT-2 era.

It can't read files anymore. Uses 'sed' out of the blue with non existent paths. In this session alone it has excused itself more then 10 times for making 'false claims'.

I hope this is a bug - it's a bad one - that will get sorted out soon. It's a complete mess.

atentaten1d ago

At least it passes the Car Wash Test this time.

1 more reply

pedro99913h ago

Maybe it's just me but whenever a new model comes out, I feel an instant boost in productivity. Probably just a placebo?

PowerElectronix11h ago

It looks like there's no more juice to squeeze out of LLMs. Will they keep throwing billions in hardware and power to the problem?

bonoboTP1d ago

It's making stupid flowcharts in the web chat interface with boxes and arrows, embedded in the response. Annoying.

bryceneal19h ago

Px-Jebaseelan8h ago

It's Gonna Eat all of my tokens in one response :(

rjhy20201d ago

OK finally Claude code is better than codex

1 more reply

motoxpro17h ago

The workflow/ultracode mode is absolutely unbelievable.

novia17h ago

got a random pair up with this model on lmarena. it was outperformed by gemma-4-31b. suffice to say i'm not impressed (or maybe i am impressed with gemma?)

NanoWar1d ago

Just show me the pelican, ah wait we are past pelicans. Can we get something like that ever again?

alasano1d ago

Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.

hereme88816h ago

Any bets on how long now until GPT-5.6 announced on HN?

I say 1-2 weeks.

nullbio15h ago

1 more reply

maxloh1d ago

Anthropic also resets my usage limits (I am in the Pro plan). That's very kind of them :)

matheusmoreira1d ago

Can I disable adaptive thinking? If not, I'm gonna keep using 4.6 as my default.

brap1d ago

Oof, this one is a major blabber.

user284020h ago

Thanks for sharing this update on Claude Opus 4.8! It's great to see Anthropic continuing to improve their models. Looking forward to trying out the new capabilities.

jruz14h ago

Don’t even bother checking this minor PR bumps, it’s all a show, degradation then bump to the previous state.

Call me when 5 drops I’ll leave this circus.

mincer_ray1d ago

seems like a really minor upgrade?

4 more replies

mophose21h ago

next (or maybe current) frontier of competition may not be the model, rather the harness and how much unique advantage a lab-created harness can beat 3rd-party harness.

Eric_Bulai1d ago

I don't know why the world is so happy about this when we should actually say stop.

1 more reply

simonw1d ago

They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...

The new "mid-conversation system messages" think is particularly interesting:

Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.

This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...

1 more reply

docheinestages1d ago

All I need for Christmas is a Claude that doesn't spit out so many em dashes.

1 more reply

hnroo991d ago

Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp

Not half bad!

2 more replies

dispencer1d ago

The smarter the model the better querybear gets. I'm happy with that.

vunderba1d ago

  "model": "claude-opus-4-6[1M]"

2 more replies

RayVR18h ago

I have been using opus 4.8 all morning and this is honestly the most sycophantic, ChatGPT like experience I have had from Anthropic. Very concerning.

carlos-menezes1d ago

I, for lack of a better word, dislike anyone who anthropomorphizes AI.

5 more replies

baroiall1d ago

Hot danm, cant wait to reach my token limit with the new LLM

willsmith7219h ago

anyone else's claude code (native install) not able to update to 2.1.154 to get 4.8?

edit: nvm was just my library network

sourcecodeplz1d ago

From the release it seems we will also get Mythos pretty soon.

plumocracy1d ago

Numbers looking good. We'll see how it actually performs.

1 more reply

sMarsIntruder15h ago

Opus 4.8 - High

> how many days in the week have the letter d in them?

1 more reply

nickstinemates15h ago

Rollout has been a little suspect. Hope it gets better.

1 more reply

hatefulheart13h ago

Oh my god! This model is incredible! A massive leap for humanity!

lylo1d ago

2 hours after I fork out for Codex Pro… :-|

1 more reply

s-a-p1d ago

1970-01-011d ago

Can anyone else see these X.Y updates aren't meeting the outrageous AI expectations that we were told we would see just a year ago?

3 more replies

DeathArrow11h ago

How many kidneys do you have to sell? Are 2 enough?

lukaslalinsky1d ago

insane_dreamer1d ago

> And fast mode for Opus 4.8—where the model can work at 2.5× the speed—is now three times cheaper than it was for previous models.

this is what I'm happy about, if true. Opus 4.7 is frustratingly slow (and, at least in my experience, much slower than 4.5 was)

blurbleblurble19h ago

4.7 broke my trust

iLemming1d ago

Why did we even get Opus 4.7, what was the point?

rvz1d ago

Anthropic has now upgraded their Claude slot machine to version 4.8.

Time to gamble even more tokens at the Anthropic casino.

1 more reply

iamsaitam22h ago

let me guess, "this is our best model yet"

noncoml14h ago

I don't know what's going on lately but Opus is extremely lazy for me...

I feel I keep having to fight the model to get it to work. Not sure if it's something in my prompts...

saaaaaam1d ago

I hope this fixes the absolute shitshow that is 4.7 and its awful “adaptive reasoning”. I tried that a few times then reverted to 4.6.

dbg3141517h ago

First impression... this catches issues that 4.7 missed, which caught issues that 4.6 missed... which caught issues that 4.5 missed...

Seems like a step in the right direction. Doesn't seem like it uses tokens more than 4.7... the token usage jumped a bunch from 4.6 to 4.7, but this seems like 4.7 or maybe even a little less.

I'm happy with this release.

firemelt1d ago

how about the bencmarks what effort did it use?

docmars21h ago

So, has it replaced the entire startup yet?

m3kw921h ago

This is Anthropic's 5.5

lidg3ai13h ago

4.6 is better

HlessClaudesman1d ago

If this model is more honest, it must be honestly praising my efforts every first sentence.

1 more reply

sgt1d ago

Interesting, I've been using 4.7 since it came out and it was pretty good for me. But in the last day or so it turned dumb. Is this normal just before they release a new one?

AtNightWeCode1d ago

Complete garbage. error, error, error. Still lags several versions behind on API:s. Can't even get any info on the model. Guessing not from this year.

Also. Look at this C++ beauty where it also uses an obsolete api.

instance = wgpuCreateInstance(&instanceDesc);

But just how exactly would this work in any context when instance is never declared.

itrunsdoomguy8h ago

Meh, it’s not able to play Doom.

maltemalte1d ago

"We’re making swift progress on developing these safeguards and expect to be able to bring Mythos-class models to all our customers in the coming weeks."

catigula1d ago

AGI post-poned?

zb31d ago

Did they reduce security research capabilities even further with this release? (they did it for opus 4.7)

guluarte1d ago

so it is worse than gpt 5.5 for coding?

2 more replies

behnamoh1d ago

1 more reply

impulser_1d ago

Crazy they bring up honest, when Claude models are literally known for straight up lying about things it has done and tries to act like it did what you asked.

2 more replies

AbuAssar1d ago

Gemini pro is embarrassing

ionwake23h ago

Im tired boss, I'm already being perfectly gaslit by the current models.

NSCaffeine22h ago

Had a feeling this was coming as in the past week 4.7 started to get dumb.

vb-84481d ago

Now i get why in the last days claude code limits were lasting few prompts ...

stainablesteel23h ago

thibran1d ago

Nice, now make it 20x cheaper.