He only seems to say this recently, since OpenAI cracked the ARC-AGI benchmark. But in the original 2019 abstract he said this:
> We argue that ARC can be used to measure a human-like form of general fluid intelligence and that it enables fair general intelligence comparisons between AI systems and humans.
https://arxiv.org/abs/1911.01547
Now he seems to backtrack, with the release of harder ARC-like benchmarks, implying that the first one didn't actually test for really general human-like intelligence.
This sounds a bit like saying that a machine beating chess would require general intelligence -- but then adding, after Deep Blue beats chess, that chess doesn't actually count as a test for AGI, and that Go is the real AGI benchmark. And after a narrow system beats Go, moving the goalpost to beating Atari, and then to beating StarCraft II, then to MineCraft, etc.
At some point, intuitively real "AGI" will be necessary to beat one of these increasingly difficult benchmarks, but only because otherwise yet another benchmark would have been invented. Which makes these benchmarks mostly post hoc rationalizations.
A better approach would be to question what went wrong with coming up with the very first benchmark, and why a similar thing wouldn't occur with the second.
> where he introduced the "Abstract and Reasoning Corpus for Artificial General Intelligence" (ARC-AGI) benchmark to measure intelligence
So, a high enough score is a threshold to claim AGI. And, if you use an LLM to work these types of problems, it becomes pretty clear that passing more tests indicates a level of "awareness" that goes beyond rational algorithms.
I thought I had seen everything until I started working on some of the problems with agents. I'm still sorta in awe about how the reasoning manifests. (And don't get me wrong, LLMs like Claude still go completely off the rails where even a less intelligent human would know better.)
Can we formalize it as giving out a task expressible in, say, n^m bytes of information that encodes a task of n^(m+q) real algorithmic and verification complexity -- then solving that task within a certain time, compute, and attempt bounds?
Something that captures "the AI was able to unwind the underlying unspoken complexity of the novel problem".
I feel like one could map a variety of easy human "brain teaser" type tasks to heuristics that fit within some mathematical framework and then grow the formalism from there.
- OpenAI's o3 counts as "AGI" when it did unexpectedly beat the ARC-AGI benchmark or
- Explicitly admit that he was wrong when assuming that ARC-AGI would test for AGI
The benchmark was literally called ARC-AGI. Only after OpenAI cracked it, they started backtracking and saying that it doesn't test for true AGI. Which undermines the whole premise of a benchmark.
But on a serious note, I don't think Chollet would disagree. ARC is a necessary but not sufficient condition, and he says that, despite the unfortunate attention-grabbing name choice of the benchmark. I like Chollet's view that we will know that AGI is here when we can't come up with new benchmarks that separate humans from AI.
A good base test would be to give a manager a mixed team of remote workers, half being human and half being AI, and seeing if the manager or any of the coworkers would be able to tell the difference. We wouldn't be able to say that AI that passed that test would necessarily be AGI, since we would have to test it in other situations. But we could say that AI that couldn't pass that test wouldn't qualify, since it wouldn't be able to successfully accomplish some tasks that humans are able to.
But of course, current AI is nowhere near that level yet. We're left with benchmarks, because we all know how far away we are from actual AGI.
These are all things my kids would do when they were pretty young.
I agree that current AI is nowhere near that level yet. If AI isn't even trying to extract meaning from the words it smiths or the pictures it diffuses then it's nothing more than a cute (albeit useful) parlor trick.
Give the AI tools and let it do real stuff in the world:
"FounderBench": Ask the AI to build a successful business, whatever that business may be - the AI decides. Maybe try to get funded by YC - hiring a human presenter for Demo Day is allowed. They will be graded on profit / loss, and valuation.
Testing plain LLM on whiteboard-style question is meaningless now. Going forward, it will all be multi-agent systems with computer use, long-term memory & goals, and delegation.
But conversely, not passing this test is a proof of not being as general as a human's intelligence.
While understanding why a person or AI is doing what it's doing can be important (perhaps specifically in safety contexts) at the end of the day all that's really going to matter to most people is the outcomes.
So if an AI can use what appears to be intelligence to solve general problems and can act in ways that are broadly good for society, whether or not it meets some philosophical definition of "intelligent" or "good" doesn't matter much – at least in most contexts.
That said, my own opinion on this is that the truth is likely in between. LLMs today seem extremely good at being glorified auto-completes, and I suspect most (95%+) of what they do is just recalling patterns in their weights. But unlike traditional auto-completes they do seem to have some ability to reason and solve truly novel problems. As it stands I'd argue that ability is fairly poor, but this might only represent 1-2% of what we use intelligence for.
If I were to guess why this is I suspect it's not that LLM architecture today is completely wrong, but that the way LLMs are trained means that in general knowledge recall is rewarded more than reasoning. This is similar to the trade-off we humans have with education – do you prioritise the acquisition of knowledge or critical thinking? Maybe believe critical thinking is more important and should be prioritised more, but I suspect for the vast majority of tasks we're interested in solving knowledge storage and recall is actually more important.
The diagnosis is pattern matching (again, roughly). It kinda suggests that a lot of "intelligent" problems are focused on pattern matching, and (relatively straightforward) application of "previous experience". So, pattern matching can bring us a great deal towards AGI.
"We argue that human cognition follows strictly the same pattern as human physical capabilities: both emerged as evolutionary solutions to specific problems in specific evironments" (from page 22 of On the Measure of Intelligence)
Francois explicitly says that's not how ARC is supposed to be interpreted.
Perhaps it's because the representations are fractured. The link above is to the transcript of an episode of Machine Learning Street Talk with Kenneth O. Stanleyabout The Fractured Entangled Representation Hypothesis[1]
For all we know, human intelligence is just an emergent property of really good pattern matching.
But then, I guess it wouldn't be "overfitting" after all, would it?
However, it does rub me the wrong way - as someone who's cynical of how branding can enable breathless AI hype by bad journalism. A hypothetical comparison would be labelling SHRDLU's (1968) performance on Block World planning tasks as "ARC-AGI-(-1)".[0]
A less loaded name like (bad strawman option) "ARC-VeryToughSymbolicReasoning" should capture how the ARC-AGI-n suite is genuinely and intrinsically very hard for current AIs, and what progress satisfactory performance on the benchmark suite would represent. Which Chollet has done, and has grounded him throughout! [1]
[0] https://en.wikipedia.org/wiki/SHRDLU [1] https://arxiv.org/abs/1911.01547
In practice when I have seen ARC brought up, it has more nuance than any of the other benchmarks.
Unlike, Humanity's Last Exam, which is the most egregious example I have seen in naming and when it is referenced in terms of an LLMs capability.
"That's not really AGI because xyz"
What then? The difficulty in coming up with a test for AGI is coming up with something that people will accept a passing grade as AGI.
In many respects I feel like all of the claims that models don't really understand or have internal representation or whatever tend to lean on nebulous or circular definitions of the properties in question. Trying to pin the arguments down usually end up with dualism and/or religion.
Doing what Chollet has done is infinitely better, if a person can easily do something and a model cannot then there is clearly something significant missing
It doesn't matter what the property is or what it is called. Such tests might even help us see what those properties are.
Anyone who wants to claim the fundamental inability of these models should be able to provide a task that it is clearly possible to tell when it has been succeeded, and to show that humans can do it (if that's the bar we are claiming can't be met). If they are right, then no future model should be able to solve that class of problems.
My definition of AGI is the one I was brought up with, not an ever moving goal post (to the "easier" side).
And no, I also don't buy that we are just stochastic parrots.
But whatever. I've seen many hypes and if I don't die and the world doesn't go to shit, I'll see a few more in the next couple of decades
Wait, what? Approximately nobody is claiming that "getting a high score on the ARC eval test means we have AGI". It's a useful eval for measuring progress along the way, but I don't think anybody considers it the final word.
Looking at the human side, it takes a while to actually learn something. If you've recently read something it remains in your "context window". You need to dream about it, to think about, to revisit and repeat until you actually learn it and "update your internal model". We need a mechanism for continuous weight updating.
Goal-generation is pretty much covered by your body constantly drip-feeding your brain various hormones "ongoing input prompts".
https://dmf-archive.github.io/docs/posts/beyond-snn-plausibl...
The second highlight from this video is the section from 29 minutes onward, where he talks about designing systems that can build up rich libraries of abstractions which can be applied to new problems. I wish he had lingered more on exploring and explaining this approach, but maybe they're trying to keep a bit of secret sauce because it's what his company is actively working on.
One of the major points which seems to be emerging from recent AI discourse is that the ability to integrate continuous learning seems like it'll be a key element in building AGI. Context is fine for short tasks, but if lessons are never preserved you're severely capped with how far the system can go.
There are dozens of ready-made, well-designed, and very creative games there. All are tile-based and solved with only arrow keys and a single action button. Maybe someone should make a PuzzleScript AGI benchmark?
https://nebu-soku.itch.io/golfshall-we-golf
Maybe someone can make an MCP connection for the AIs to practice. But I think the idea of the benchmark is to reserve some puzzles for private evaluation, so that they're not in the training data.
One thing he showed is that you can't have a universe with two omniscient intelligences (as it would be intractable for them to predict the other's behavior.)
It's also very questionable whether "humanlike" intelligence is truly general in the first place. I think cognitive neurobiologists would agree that we have a specific "cognitive niche", and while this symbolic niche seems sufficiently general for a lot of problems, there are animals that make us look stupid in other respects. This whole idea that there is some secret sauce special algorithm for universal intelligence is extremely suspect. We flatter ourselves and have committed to a fundamental anthropomorphic fallacy that seems almost cartoonishly elementary for all the money behind it.
See however that the theorem is quite weak. Requires eg the assumption that the search space has no structure. They even have the example of quadratic problems. It's mostly a useless saying, it appears to me.
[1] https://en.m.wikipedia.org/wiki/No_free_lunch_in_search_and_...
You can't define AGI, any more than you can define ASA (artificial sports ability). Intelligence, like athleticism changes both quantitively and qualitatively. The Greek Olympic champions of 2K yrs ago wouldn't qualify for high school championships today, however, they were once regarded as great athletes.
Getting a perfect ARC-AGI-n score isn't a smoking gun indicator of general intelligence. Rather, it simply means we're now able to solve a class of problems previously beyond AI capabilities (which is exciting in itself!).
I view ARC-AGI primarily as a benchmark (similar in spirit to Raven's matrices) that makes memorization substantially harder. Compare this with vocabulary-focused IQ tests, where cognitive skills certainly matter, but results depend heavily on exposure to a particular language.
But if slime mold symbolic space is better suited for something like understanding of biology or abstract math, that's a good damn reason to go for the slime mold route too.
Man I don't know. A random 10 year old has general intelligence and ain't gonna do too well on these tests. AGI is not consciousness, I feel like that also gets confused, and general intelligence is not superhuman intelligence.
My impression is that models are pretty bad at interpreting grids of characters. Yesterday, I was trying to get Claude to convert a message into a cipher where it converted a 98-character string into 7x14 grid where the sequential letters moved 2-right and 1-down (i.e., like a knight it chess). Claude seriously struggled.
Yet, Francois always pumps up the "fluid intelligence" component of this test and emphasizes how easy these are for humans. Yet, humans would presumably be terrible at the tasks if they looked at it character-by-character
This feels like a somewhat similar (intuition-lie?) case as the Apple paper showing how reasoning model's can't do tower of hanoi past 10+ disks. Readers will intuitively think about how they themselves could tediously do an infinitely long tower of hanoi, which is what the paper is trying to allude to. However, the more appropriate analogy would be writing out all >1000 moves on a piece of paper at once and being 100% correct, which is obviously much harder
I've seen a simple ARC-AGI test that took the open set, and doubled every image in it. Every pixel became a 2x2 block of pixels.
If LLMs were bottlenecked solely by reasoning or logic capabilities, this wouldn't change their performance all that much, because the solution doesn't change all that much.
Instead, the performance dropped sharply - which hints that perception is the bottleneck.
Look how we learned physics. Aristotelian physics was "An object in motion tends to come to a stop." That looked right most of the time a bowling ball on sand, grass, or even dirt comes to a stop pretty fast. But once you have a nice smooth marble floor the ball goes a lot further.
Newtonian physics solved that and several other issues and works fine, most of the time, but has corner cases when going very fast or getting near a high gravity location. Then relativity and the rest.
We need to build a system that we can teach like we do children that lets them reason that something is true under certain circumstances but may not hold generally so have to update what true is. And that looks like statistics.
So we might say, “General Intelligence is the ability to do the things we haven’t yet thought of.”
“Like what?”
“Well, as soon as I name something it stops counting.”
Gödellian - I like it. Does that mean a constructive definition of General Intelligence is uncomputable?
It churned for >5 minutes and didn't solve the problem.
grep, of course, solved the problem in under a second.
AGI is going to be a while, and your jobs are safe.
Gemini CLI isn't very good
Edit: Also, the more competent models (Opus/ Sonnet to a lesser degree) are good at very complex subtask delegation that it can blow through and attempt and then verify in seconds, so not sure hand crafted regex examples are the best counter examples here
New code patch models that I didn't even take seriously are actually really impressive and pretty new
I would have thought/considered AGI to be something that is constantly aware, a biological brain is always on. An LLM is on briefly while it's inferring.
A biological brain constantly updates itself adds memories of things. Those memories generally stick around.
It's conceivable (though not likely) that given training enough training in symbolic mathematics and some experimental data, an LLM-style AI could figure out a neat reconciliation of the two theories. I wouldn't say that makes it AGI though. You could achieve that unification with an AI that was limted to mathematics rather than being something that can function in many domains like a human can.
But consider: technically AlphaTensor found new algorithms no human did before (https://en.wikipedia.org/wiki/Matrix_multiplication_algorith...). So isn't it AGI by your definition of answering a question no human could before: how to do 4x4 matrix multiplication in 47 steps?
https://arxiv.org/abs/1911.01547
GPT-3 didn't come out until 2020.
That said, I'd still listen these two guys (+ Schmidhuber) more than any other AI-guy.
https://news.ycombinator.com/item?id=44492241
My comment was basically instantly flagged. I see at least 3 other flagged comments that I can't imagine deserve to be flagged.
If you see a talk like: "How we will develop diplomacy with the rat-people of TRAPPIST-5." you don't have to make some argument about super-earths and gravity and the rocket equation. You can just point out it's absurd to pretend to know something like whether there are rat-people there.
Either way, it isn't flag-able!
If we assume that humans have "general intelligence", we would assume all humans could ace Arc... but they can't. Try asking your average person, i.e. supermarket workers, gas station attendants etc to do the Arc puzzles, they will do poorly, especially on the newer ones, but AI has to do perfectly to prove they have general intelligence? (not trying to throw shade here but the reality is this test is more like an IQ test than an AGI test).
Arc is a great example of AI researchers moving the goal posts for what we consider intelligent.
Let's get real, Claude Opus is smarter than 99% of people right now, and I would trust its decision making over 99% of people I know in most situations, except perhaps emotion driven ones.
Arc agi benchmark is just a gimmick. Also, since it's a visual test and the current models are text based it's actually a rigged (against the AI models) test anyway, since their datasets were completely text based.
Basically, it's a test of some kind, but it doesn't mean quite as much as Chollet thinks it means.
If we think humans have "GI" then I think we have AIs right now with "GI" too. Just like humans do, AIs spike in various directions. They are amazing at some things and weak at visual/IQ test type problems like ARC.
I think the debate hqas been flat-footed by the speed all this happened. We're not talking AGI any more, we're talking about how to build superintelligences hitherto unseen in nature.
If AI at least equal humans in all intellectual fields then they are super-intelligences, because there are already fields where they dominate humans so outrageously there isn't a competition (nearly all fields, these days). Before they are superintelligences there is a phase where they are just AGIs, we've been in that phase for a while now. Artificial superintelligence is very exciting, but Artificial non-super Intelligence or AGI is here with us in the present.
I enjoy seeing people repeatedly move the goalposts for "intelligence" as AIs simply get smarter and smarter every week. Soon AI will have to beat Einstein in Physics, Usain Bolt in running, and Steve Jobs in marketing to be considered AGI...