undefined | Better HN

0 pointsmodeless1y ago0 comments

Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front. (Not sure I believe the speculation about o3's internals in the link.)

What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.

0 comments

hdjjhhvvhga1y ago

Your argumentation seems convincing but I'd like to offer a competitive narrative: any benchmark that is public becomes completely useless because companies optimize for it - especially AI that depends on piles of money and they need some proof they are developing.

That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).

On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.

stonemetal121y ago

Rather any Logic puzzle you post on the internet as something AIs are bad at is in the next round of training data so AIs get better at that specific question. Not because AI companies are optimizing for a benchmark but because they suck up everything.

modelessOP1y ago

ARC has two test sets that are not posted on the Internet. One is kept completely private and never shared. It is used when testing open source models and the models are run locally with no internet access. The other test set is used when testing closed source models that are only available as APIs. So it could be leaked in theory, but it is still not posted on the internet and can't be in any web crawls.

You could argue that the models can get an advantage by looking at the training set which is on the internet. But all of the tasks are unique and generalizing from the training set to the test set is the whole point of the benchmark. So it's not a serious objection.

1 more reply

HarHarVeryFunny1y ago

> o3 presumably isn't doing program synthesis

I'd guess it's doing natural language procedural synthesis, the same way a human might (i.e. figuring the sequence of steps to effect the transformation), but it may well be doing (sub-)solution verification by using the procedural description to generate code whose output can then be compared to the provided examples.

While OpenAI haven't said exactly what the architecture of o1/o3 are, the gist of it is pretty clear - basically adding "tree" search and iteration on top of the underlying LLM, driven by some RL-based post-training that imparts generic problem solving biases to the model. Maybe there is a separate model orchestrating the search and solution evaluation.

I think there are many tasks that are easy enough for humans but hard/impossible for these models - the ultimate one in terms of commercial value would be to take an "off the shelf model" and treat it as an intern/apprentice and teach it to become competent in a entire job it was never trained on. Have it participate in team meetings and communications, and become a drop-in replacement for a human performing that job (any job that an be performed remotely without a physical presence).

refulgentis1y ago

> Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front.

Agreed.

> And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs.

? There's plenty.

modelessOP1y ago

I'd love to hear about more. Which ones are you thinking of?

refulgentis1y ago

- "Are You Human" https://arxiv.org/pdf/2410.09569 is designed to be directly on target, i.e. cross cutting set of questions that are easy for humans, but challenging for LLMs, Instead of one type of visual puzzle. Much better than ARC for the purpose you're looking for.

- SimpleBench https://simple-bench.com/ (similar to above; great landing page w/scores that show human / ai gap)

- PIQA (physical question answering, i.e. "how do i get a yolk out of a water bottle", common favorite of local llm enthusiasts in /r/localllama https://paperswithcode.com/dataset/piqa

- Berkeley Function-Calling (I prefer https://gorilla.cs.berkeley.edu/leaderboard.html)

AI search googled "llm benchmarks challenging for ai easy for humans", and "language model benchmarks that humans excel at but ai struggles with", and "tasks that are easy for humans but difficult for natural language ai".

It also mentioned Moravec's Paradox is a known framing of this concept, started going down that rabbit hole because the resources were fascinating, but, had to hold back and submit this reply first. :)

2 more replies

QuantumGood1y ago

Gaming the benchmarks usually needs to be considered first when evaluating new results.

bubblyworld1y ago

I think gaming the benchmarks is encouraged in the ARC AGI context. If you look at the public test cases you'll see they test a ton of pretty abstract concepts - space, colour, basic laws of physics like gravity/magnetism, movement, identity and lots of other stuff (highly recommend exploring them). Getting an AI to do well at all, regardless of whether it was gamed or not, is the whole challenge!

chaps1y ago

Honestly, is gaming benchmarks actually a problem in this space in that it still shows something useful? Just means we need more benchmarks, yeah? It really feels not unlike keggle competitions.

We do the same exact stuff with real people with programming challenges and such where people just study common interview questions rather than learning the material holistically. And since we know that people game these interview type questions, we can adjust the interview processes to minimize gamification.... which itself leads to gamification and back to step one. That's not ideal an ideal feedback loop of course, but people still get jobs and churn out "productive work" out of it.

ben_w1y ago

AI are very good at gaming benchmarks. Both as overfitting and as Goodhart's law, gaming benchmarks has been a core problem during training for as long as I've been interested in the field.

Sometimes this manifests as "outside the box thinking", like how a genetic algorithm got an "oscillator" which was really just an antenna.

It is a hard problem, and yes we still both need and can make more and better benchmarks; but it's still a problem because it means the benchmarks we do have are overstating competence.

2 more replies

j / k navigate · click thread line to collapse

0 comments

hdjjhhvvhga1y ago

That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).

stonemetal121y ago

modelessOP1y ago

1 more reply

HarHarVeryFunny1y ago

> o3 presumably isn't doing program synthesis

refulgentis1y ago

> Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front.

Agreed.

> And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs.

? There's plenty.

modelessOP1y ago

I'd love to hear about more. Which ones are you thinking of?

refulgentis1y ago

- SimpleBench https://simple-bench.com/ (similar to above; great landing page w/scores that show human / ai gap)

- PIQA (physical question answering, i.e. "how do i get a yolk out of a water bottle", common favorite of local llm enthusiasts in /r/localllama https://paperswithcode.com/dataset/piqa

- Berkeley Function-Calling (I prefer https://gorilla.cs.berkeley.edu/leaderboard.html)

2 more replies

QuantumGood1y ago

Gaming the benchmarks usually needs to be considered first when evaluating new results.

bubblyworld1y ago

chaps1y ago

Honestly, is gaming benchmarks actually a problem in this space in that it still shows something useful? Just means we need more benchmarks, yeah? It really feels not unlike keggle competitions.

ben_w1y ago

AI are very good at gaming benchmarks. Both as overfitting and as Goodhart's law, gaming benchmarks has been a core problem during training for as long as I've been interested in the field.

Sometimes this manifests as "outside the box thinking", like how a genetic algorithm got an "oscillator" which was really just an antenna.

It is a hard problem, and yes we still both need and can make more and better benchmarks; but it's still a problem because it means the benchmarks we do have are overstating competence.

2 more replies

j / k navigate · click thread line to collapse