undefined | Better HN

0 pointsstingraycharles1mo ago0 comments

> Assuming it is almost as good as Opus 4.6 (which benchmarks seem to give evidence for)

That’s a big if. It’s my experience that models that perform very well on benchmarks do not necessarily perform well in real life.

I’ve mostly started ignoring the benchmarks and run my own evals.

0 comments

ting01mo ago

> It’s my experience that models that perform very well on benchmarks do not necessarily perform well in real life

Well, yeah... Like Opus 4.5, 4.6, 4.7. Top of the benchmarks and yet it's a pile of crap at the moment and has been for months.

jatora1mo ago

If benchmarks are all to be believed then gemini 3.1 and grok 4.2 are still in the lead pack. A laughable notion to anyone who has actually tried to use them and compared.

j / k navigate · click thread line to collapse