undefined | Better HN

0 pointsrefulgentis1y ago0 comments

This emphasizes persons and a self-conceived victory narrative over the ground truth.

Models have regularly made progress on it, this is not new with the o-series.

Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.

I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.

Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.

0 comments

modeless1y ago

Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front. (Not sure I believe the speculation about o3's internals in the link.)

What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.

4 more replies

stego-tech1y ago

I won't be as brutal in my wording, but I agree with the sentiment. This was something drilled into me as someone with a hobby in PC Gaming and Photography: benchmarks, while handy measures of potential capabilities, are not guarantees of real world performance. Very few PC gamers completely reinstall the OS before benchmarking to remove all potential cruft or performance impacts, just as very few photographers exclusively take photos of test materials.

While I appreciate the benchmark and its goals (not to mention the puzzles - I quite enjoy figuring them out), successfully passing this benchmark does not demonstrate or guarantee real world capabilities or performance. This is why I increasingly side-eye this field and its obsession with constantly passing benchmarks and then moving the goal posts to a newer, harder benchmark that claims to be a better simulation of human capabilities than the last one: it reeks of squandered capital and a lack of a viable/profitable product, at least to my sniff test. Rather than simply capitalize on their actual accomplishments (which LLMs are - natural language interaction is huge!), they're trying to prove to Capital that with a few (hundred) billion more in investments, they can make AGI out of this and replace all those expensive humans.

They've built the most advanced prediction engines ever conceived, and insist they're best used to replace labor. I'm not sure how they reached that conclusion, but considering even their own models refute this use case for LLMs, I doubt their execution ability on that lofty promise.

danielmarkbruce1y ago

100%. The hype is misguided. I doubt half the people excited about the result have even looked at what the benchmark is.

j / k navigate · click thread line to collapse