undefined | Better HN

0 pointsverdverm5mo ago0 comments

We're also in benchmark saturation territory. I heard it speculated that Anthropic emphasizes benchmarks less in their publications because internally they don't care about them nearly as much as making a model that works well on the day-to-day

0 comments

stego-tech5mo ago

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

snet05mo ago

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

jacquesm5mo ago

That means you're probably asking it to do very simple things.

4 more replies

verdvermOP5mo ago

I'm not sure, here's my anecdotal counter example, was able to get gemini-2.5-flash, in two turns, to understand and implement something I had done separately first, and it found another bug (also that I had fixed, but forgot was in this path)

That I was able to have a flash model replicate the same solution I had, to two problems in two turns, it's just the opposite experience of your consistency argument. I'm using tasks I've already solved as the evals while developing my custom agentic setup (prompts/tools/envs). They are able to do more of them today then they were even 6-12 months ago (pre-thinking models).

https://bsky.app/profile/verdverm.com/post/3m7p7gtwo5c2v

stego-tech5mo ago

And therein lies the rub for why I still approach this technology with caution, rather than charge in full steam ahead: variable outputs based on immensely variable inputs.

I read stories like yours all the time, and it encourages me to keep trying LLMs from almost all the major vendors (Google being a noteworthy exception while I try and get off their platform). I want to see the magic others see, but when my IT-brain starts digging in the guts of these things, I’m always disappointed at how unstructured and random they ultimately are.

Getting back to the benchmark angle though, we’re firmly in the era of benchmark gaming - hence my quip about these things failing “the only benchmark that matters.” I meant for that to be interpreted along the lines of, “trust your own results rather than a spreadsheet matrix of other published benchmarks”, but I clearly missed the mark in making that clear. That’s on me.

1 more reply

quantumHazer5mo ago

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

verdvermOP5mo ago

Building a good model generally means it will do well on benchmarks too. The point of the speculation is that Anthropic is not focused on benchmaxxing which is why they have models people like to use for their day-to-day.

I use Gemini, Anthropic stole $50 from me (expired and kept my prepaid credits) and I have not forgiven them yet for it, but people rave about claude for coding so I may try the model again through Vertex Ai...

The person who made the speculation I believe was more talking about blog posts and media statements than model cards. Most ai announcements come with benchmark touting, Anthropic supposedly does less / little of this in their announcements. I haven't seen or gathered the data to know what is truth

elcritch5mo ago

You could try Codex cli. I prefer it over Claude code now, but only slightly.

1 more reply

Mistletoe5mo ago

How do you measure whether it works better day to day without benchmarks?

bulbar5mo ago

Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.

That's still benchmarking of course, but not utilizing any of the well known / public ones.

verdvermOP5mo ago

Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better

aydyn5mo ago

Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.

1 more reply

standardUser5mo ago

Subscriptions.

mrguyorama5mo ago

Ah yes, humans are famously empirical in their behavior and we definitely do not have direct evidence of the "best" sports players being much more likely than the average to be superstitious or do things like wear "lucky underwear" or buy right into scam bracelets that "give you more balance" using a holographic sticker.

1 more reply

brokensegue5mo ago

how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate

verdvermOP5mo ago

more or less this, but also synthetic

if you think about GANs, it's all the same concept

1. train model (agent)

2. train another model (agent) to do something interesting with/to the main model

3. gain new capabilities

4. iterate

You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.

Getting into the guts of agentic systems has me believing we have quite a bit of runway for iteration here, especially as we move beyond single model / LLM training. I still need to get into what all is de jour in the RL / late training, that's where a lot of opportunity lies from my understanding so far

Nathan Lambert (https://bsky.app/profile/natolambert.bsky.social) from Ai2 (https://allenai.org/) & RLHF Book (https://rlhfbook.com/) has a really great video out yesterday about the experience training Olmo 3 Think

https://www.youtube.com/watch?v=uaZ3yRdYg8A

HDThoreaun5mo ago

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

fwip5mo ago

It is very similar to an IQ test, with all the attendant problems that entails. Looking at the Arc-AGI problems, it seems like visual/spatial reasoning is just about the only thing they are testing.

CamperBob25mo ago

Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

npinsker5mo ago

Completely false. This is like saying being good at chess is equivalent to being smart.

Look no farther than the hodgepodge of independent teams running cheaper models (and no doubt thousands of their own puzzles, many of which surely overlap with the private set) that somehow keep up with SotA, to see how impactful proper practice can be.

The benchmark isn’t particularly strong against gaming, especially with private data.

2 more replies

ACCount375mo ago

With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.

Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.

Let's assume "all metrics are perfect" for now. Then, when you score people by "chess performance"? You wouldn't see the people with the highest intelligence ever at the top. You'd get people with pretty high intelligence, but extremely, hilariously strong chess-specific skills. The tails came apart.

Same goes for things like ARC-AGI and ARC-AGI-2. It's an interesting metric (isomorphic to the progressive matrix test? usable for measuring human IQ perhaps?), but no metric is perfect - and ARC-AGI is biased heavily towards spatial reasoning specifically.

jimbokun5mo ago

Is it different every time? Otherwise the training could just memorize the answers.

1 more reply

FergusArgyll5mo ago

It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all

esafak5mo ago

I would not be so sure. You can always prep to the test.

1 more reply

j / k navigate · click thread line to collapse

0 comments

stego-tech5mo ago

These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up?

Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.”

It’s 2000’s PC gaming all over again (“gotta game the benchmark!”).

snet05mo ago

To say that a model won't solve a problem is unfair. Claude Code, with Opus 4.5, has solved plenty of problems for me.

If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.

jacquesm5mo ago

That means you're probably asking it to do very simple things.

4 more replies

verdvermOP5mo ago

https://bsky.app/profile/verdverm.com/post/3m7p7gtwo5c2v

stego-tech5mo ago

And therein lies the rub for why I still approach this technology with caution, rather than charge in full steam ahead: variable outputs based on immensely variable inputs.

1 more reply

quantumHazer5mo ago

Seems pretty false if you look at the model card and web site of Opus 4.5 that is… (check notes) their latest model.

verdvermOP5mo ago

elcritch5mo ago

You could try Codex cli. I prefer it over Claude code now, but only slightly.

1 more reply

Mistletoe5mo ago

How do you measure whether it works better day to day without benchmarks?

bulbar5mo ago

Manually labeling answers maybe? There exist a lot of infrastructure built around and as it's heavily used for 2 decades and it's relatively cheap.

That's still benchmarking of course, but not utilizing any of the well known / public ones.

verdvermOP5mo ago

Internal evals, Big AI certainly has good, proprietary training and eval data, it's one reason why their models are better

aydyn5mo ago

Then publish the results of those internal evals. Public benchmark saturation isn't an excuse to be un-quantitative.

1 more reply

standardUser5mo ago

Subscriptions.

mrguyorama5mo ago

1 more reply

brokensegue5mo ago

how do you quantitatively measure day-to-day quality? only thing i can think is A/B tests which take a while to evaluate

verdvermOP5mo ago

more or less this, but also synthetic

if you think about GANs, it's all the same concept

1. train model (agent)

2. train another model (agent) to do something interesting with/to the main model

3. gain new capabilities

4. iterate

You can use a mix of both real and synthetic chat sessions or whatever you want your model to be good at. Mid/late training seems to be where you start crafting personality and expertises.

https://www.youtube.com/watch?v=uaZ3yRdYg8A

HDThoreaun5mo ago

Arc-AGI is just an iq test. I don’t see the problem with training it to be good at iq tests because that’s a skill that translates well.

fwip5mo ago

It is very similar to an IQ test, with all the attendant problems that entails. Looking at the Arc-AGI problems, it seems like visual/spatial reasoning is just about the only thing they are testing.

CamperBob25mo ago

Exactly. In principle, at least, the only way to overfit to Arc-AGI is to actually be that smart.

Edit: if you disagree, try actually TAKING the Arc-AGI 2 test, then post.

npinsker5mo ago

Completely false. This is like saying being good at chess is equivalent to being smart.

The benchmark isn’t particularly strong against gaming, especially with private data.

2 more replies

ACCount375mo ago

With this kind of thing, the tails ALWAYS come apart, in the end. They come apart later for more robust tests, but "later" isn't "never", far from it.

Having a high IQ helps a lot in chess. But there's a considerable "non-IQ" component in chess too.

jimbokun5mo ago

Is it different every time? Otherwise the training could just memorize the answers.

1 more reply

FergusArgyll5mo ago

It's very much a vision test. The reason all the models don't pass it easily is only because of the vision component. It doesn't have much to do with reasoning at all

esafak5mo ago

I would not be so sure. You can always prep to the test.

1 more reply

j / k navigate · click thread line to collapse