Mainframes -> desktop computers -> a computer in every hand
Obese LLMs you visit -> agents riding with you whereever you are, integrated into your life and things -> everything everywhere, max specialization and distribution into every crevice, dominance over most tasks whether you're there or active or not
They haven't even really started working together yet. They're still largely living in sandboxes. We're barely out of the first inning. Pick a field and it's likely hardly even at the first pitch for most of them you can name, eg aircraft/flight.
In hindsight people will (jokingly?) wonder whether AI self-selected software development as one of its first conquests, as the ultimate foot in the door so it could pursue dominion over everything else (of course it had to happen in that progression; it'll prompt some chicken or the egg debates 30-50 years out).
We have many, many people around the world die all the time from easily curable and preventable diseases, we just choose not to. This is largely not a technology problem. Just look at PEPFAR, which saved tens of millions of lives from HIV/AIDS. We just decided to stop funding it: https://en.wikipedia.org/wiki/President%27s_Emergency_Plan_f...
Is it possible that reason could emerge as the byproduct of being really good at predicting words? Maybe, but this depends on the antecedent claim that much if not all of reason is strictly representational and strictly linguistic. It's not obvious to me that this is the case. Many people think in images as direct sense datum, and it's not clear that a digital representation of this is equivalent to the thing in itself.
To use an example another HN'er suggested, We don't claim that submarines are swimming. Why are we so quick to claim that LLMs are "reasoning"?
Imagine we had such marketing behind wheels — they move, so they must be like legs on the inside. Then we run around imagining what the blood vessels and bones must look like inside the wheel. Nevermind that neither the structure nor the procedure has anything to do with legs whatsoever.
Sadly, whoever named it artificial intelligence and neural networks likely knew exactly what they were doing.
I'm with you on this. Software engineers talk about being in the flow when they are at their most productive. For me, the telltale sign of being in the flow is that I'm no longer thinking in English, but I'm somehow navigating the problem / solution space more intuitively. The same thing happens in many other domains. We learn to walk long before we have the language for all the cognitive processes required. I don't think we deeply understand what's going in these situations, so how are we going to build something to emulate it? I certainly don't consciously predict the next token, especially when I'm in the flow.
And why would we try to emulate how we do it? I'd much rather have technology that complements. I want different failure modes and different abilities so that we can achieve more with these tools than we could by just adding subservient humans. The good news is that everything we've built so far is succeeding at this!
We'll know that society is finally starting to understand these technologies and how to apply them when we are able to get away from using science fiction tropes to talk about them. The people I know who develop LLMs for a living, and the others I know that are creating the most interesting applications of them, already talk about them as tools without any need to anthropomorphize. It's sad to watch their frustration as they are slowed down every time a person in power shows up with a vision based on assumptions of human-like qualities rather than a vision informed by the actual qualities of the technology.
Maybe I'm being too harsh or impatient? I suppose we had to slowly come to understand the unique qualities of a "car" before we could stop limiting our thinking by referring to it as a "horseless carriage".
On a more general level, I also never understood this urge to build machines that are "just like us". Like you I want machines that, arguably, are best characterized by the ways in which they are not like us—more reliable, more precise, serving a specific function. It's telling that critiques of the failures of LLMs are often met with "humans have the same problems"—why are humans the bar? We have plenty of humans. We don't need more humans. If we're investing so much time and energy, shouldn't the bar be bette than humans? And if it isn't, why isn't it? Oh, right it's because actually human error is good enough and the actual benefit of these tools is that they are humans that can work without break, don't have autonomy, and that you don't need to listen to or pay. The main beneficiaries of this path are capital owners who just want free labor. That's literally all this is. People who actually want to build stuff want precision machines that are tailored for the task at hand, not some grab bag of sort of works sometimes stochastic doohickeys.
This is true of course in a pointlessly rhetorical sense.
Completely absurd though once we change "swimming" to the more precise "moving through water".
The solution is not to put arms and legs on the submarine so it can ACTUALLY swim.
It would be quite trivial to make a Gary Marcus style argument that humans still can't fly. We would need much longer and wider arms, much less core body mass, feathers.
Most of these newer models are multi-modal, so tokens aren't necessary linguistic.
The mechanism by which they work prohibits reasoning.
This is easy to see if you look at a transformer architecture and think through what each step is doing.
The amazing thing is that they produce coherent speech, but they literally can't reason.
What's more, his actual point is unclear. Even if you simply grant, "okay, even SOTA LLMs don't have world models", why do I as a user of these models care? Because the models could be wrong? Yes, I'm aware. Nevertheless, I'm still deriving subtantial personal and professional value from the models as they stand today.
Both statistical data generators and actual reasoning are useful in many circumstances, but there are also circumstances in which thinking that you are doing the latter when you are only doing the former can have severe consequences (example: building a bridge).
If nothing else, his perspective is a counterbalance to what is clearly an extreme hype machine that is doing its utmost to force adoption through overpromising, false advertising, etc. These are bad things even if the tech does actually have some useful applications.
As for benchmarks, if you fundamentally don't believe that stochastic data generation leads to reason as an emergent property, developing a benchmark is pointless. Also, not everyone has to be on the same side. It's clear that Marcus is not a fan of the current wave. Asking him to produce a substantive contribution that would help them continue to achieve their goals is preposterous. This game is highly political too. If you think the people pushing this stuff are less than estimable or morally sound, you wouldn't really want to empower them or give them more ideas.
In other words, overhyped in the short term, underhyped in the long term. Where short and long term are extremely volatile.
Take programming as an example. 2.5 years ago, gpt3.5 was seen as "cute" in the programming world. Oh, look, it does poems and e-mails, and the code looks like python but it's wrong 9 times out of 10. But now a 24B model can handle end-to-end SWE tasks in 0-shot a lot of the times.
To use chess as an example. Humans sometimes play illegal moves. That does not mean Humans cannot reason. It is an instance of failing to show proof of reasoning. Not a proof of the inability to reason.
I find it astonishing that people pay any attention to Gary Marcus and doubly so here. Whether or not you are an “AI optimist”, he clearly is just a bloviator.
https://www.anthropic.com/news/tracing-thoughts-language-mod...
I don't see much reason why future AI couldn't do that rather than just focusing on language though.
Understanding may not be a static symbolic representation. Contexts of the world infinite and continuously redefined. We believed we could represent all contexts tied to information, but that's a tough call.
Yes, we can approximate. No, we can't completely say we can represent every essential context at all times.
Some things might not be representable at all by their very chaotic nature.
Hm.
Dead reckoning is a terrible way to navigate, and famously led to lots of ships crashed on the shore of France before good clocks allowed tracking longitude accurately.
Ants lay down pheromone trails and use smell to find their way home... There's likely some additional tracking going on, but I would be surprised if it looked anything like symbolic GOFAI.
Many animals detect and interpret smells as chemical gradients. We don't have the hardware for it, but plenty of others do.
But some words are redacted. So I've uploaded the picture to Gemini and asked it what the redacted words are, and it told me. Not sure if they are correct, and some are way longer to fit in the redacted black box, but it didn't refuse the request.
https://arxiv.org/abs/2506.01622
Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent's policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.
On my reading, the philosophical claim is that these models do not develop an actual logical, internal representation of domains.
The functional import is whether or not they are able to realize specific behaviors within a domain. The paper argues that a markov process can realize the functional equivalence of the initial goal oriented picture of its domain—that is can solve goals with an error bound—but not that it develops an actual representation of the domain.
Lack of an actual representation prevents such a machine from doing other things. For example, iiuc, it would be unable to solve problems in domains that are homomorphic to the original, while an explicit representation does enable this.
The lack of a world model is a very real limitation in some problem spaces, starting with arithmetic. But this argument is unconvincing.
The question is not whether or not they have any model at all, the question is whether the model they indisputably have (which is a model of language in terms of linear algebra) maps onto a model of the external universe (a “world model”) that emerges during training.
This is pretty much an unfalsifiable question as far as I can see. There has been research that aims to show this one way or another and it doesn’t settle the question of what a “world model” even means if you permit a “world model” to mean anything other than “thinks like we do”.
For example, LLMs have been shown to produce code that can make graphics somewhat in the style of famous modern artists (eg Kandinsky and Mondrian) but fail at object-stacking problems (“take a book, four wine glasses, a tennis ball, a laptop and a bottle and stack them in a stable arrangement”). Depending on the objects you choose the LLM either succeeds or fails (generally in a baffling way). So what does this mean? Clearly the model doesn’t “know” the shape of various 3-D objects (unless the problem is in their training set which it sometimes seems to be) but on the other hand seems to have shown some ability to pastiche certain visual styles. How is any of this conclusive? A baby doesn’t understand the 3-D world either. A toddler will try and fail to stack things in various ways. Are they showing the presence or lack of a world model? How do you tell?
But that doesn't mean that we can't, in theory, give the LLM a battery of tests that it should perform well (though not perfectly) on if it has a world model, and poorly (though not fail totally) on if it doesn't.
It's inherently a probabilistic system, so testing it in a probabilistic manner seems perfectly apt. Again: no, this will not produce a definitive result, due to that probabilistic nature—but it can produce an indicative one, and running the same test on multiple related LLMs, or similar tests on the same LLM, should help to smooth out noise in the results.
(...of course, this only works if the tests are designed well, and I don't have enough specific understanding of LLMs to know how one would go about doing that in a rigorous manner!)
Obviously false for any useful sense by which you might operationalize "world model". But agree re: being a black box and having a world model being orthogonal.