What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.
That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).
On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.
You could argue that the models can get an advantage by looking at the training set which is on the internet. But all of the tasks are unique and generalizing from the training set to the test set is the whole point of the benchmark. So it's not a serious objection.
I'd guess it's doing natural language procedural synthesis, the same way a human might (i.e. figuring the sequence of steps to effect the transformation), but it may well be doing (sub-)solution verification by using the procedural description to generate code whose output can then be compared to the provided examples.
While OpenAI haven't said exactly what the architecture of o1/o3 are, the gist of it is pretty clear - basically adding "tree" search and iteration on top of the underlying LLM, driven by some RL-based post-training that imparts generic problem solving biases to the model. Maybe there is a separate model orchestrating the search and solution evaluation.
I think there are many tasks that are easy enough for humans but hard/impossible for these models - the ultimate one in terms of commercial value would be to take an "off the shelf model" and treat it as an intern/apprentice and teach it to become competent in a entire job it was never trained on. Have it participate in team meetings and communications, and become a drop-in replacement for a human performing that job (any job that an be performed remotely without a physical presence).
Agreed.
> And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs.
? There's plenty.
- SimpleBench https://simple-bench.com/ (similar to above; great landing page w/scores that show human / ai gap)
- PIQA (physical question answering, i.e. "how do i get a yolk out of a water bottle", common favorite of local llm enthusiasts in /r/localllama https://paperswithcode.com/dataset/piqa
- Berkeley Function-Calling (I prefer https://gorilla.cs.berkeley.edu/leaderboard.html)
AI search googled "llm benchmarks challenging for ai easy for humans", and "language model benchmarks that humans excel at but ai struggles with", and "tasks that are easy for humans but difficult for natural language ai".
It also mentioned Moravec's Paradox is a known framing of this concept, started going down that rabbit hole because the resources were fascinating, but, had to hold back and submit this reply first. :)
We do the same exact stuff with real people with programming challenges and such where people just study common interview questions rather than learning the material holistically. And since we know that people game these interview type questions, we can adjust the interview processes to minimize gamification.... which itself leads to gamification and back to step one. That's not ideal an ideal feedback loop of course, but people still get jobs and churn out "productive work" out of it.
Sometimes this manifests as "outside the box thinking", like how a genetic algorithm got an "oscillator" which was really just an antenna.
It is a hard problem, and yes we still both need and can make more and better benchmarks; but it's still a problem because it means the benchmarks we do have are overstating competence.