Most dynamics of the physical world are sparse, non-linear systems at every level of resolution. Most ways of constructing accurate models mathematically don’t actually work. LLMs, for better or worse, are pretty classic (in an algorithmic information theory sense) sequential induction problems. We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch.
There are a bunch of fundamental computer science problems that stand in the way, which I was schooled on in 2006 from the brightest minds in the field. For example, how do you represent arbitrary spatial relationships on computers in a general and scalable way? There are no solutions in the public data structures and algorithms literature. We know that universal solutions can’t exist and that all practical solutions require exotic high-dimensionality computational constructs that human brains will struggle to reason about. This has been the status quo since the 1980s. This particular set of problems is hard for a reason.
I vigorously agree that the ability to reason about spatiotemporal dynamics is critical to general AI. But the computer science required is so different from classical AI research that I don’t expect any pure AI researcher to bridge that gap. The other aspect is that this area of research became highly developed over two decades but is not in the public literature.
One of the big questions I have had since they announced the company, is who on their team is an expert in the dark state-of-the-art computer science with respect to working around these particular problems? They risk running straight into the same deep, layered theory walls that almost everyone else has run into. I can’t identify anyone on the team that is an expert in a relevant area of computer science theory, which makes me skeptical to some extent. It is a nice idea but I don’t get the sense they understand the true nature of the problem.
Nonetheless, I agree that it is important!
"Spatial awareness" itself is kind of a simplification: the idea that you can be aware of space or 3d objects' behavior without the social context of what an "object" is or how it relates to your own physical existence. Like you could have two essentially identical objects but they are not interchangeable (original Declaration of Independence vs a copy, etc). And many many other borderline-philosophical questions about when an object becomes two, etc.
…yet.
15 years ago LLMs as they are today seemed like science fiction too.
Why wouldn't it be? If the world is ingressed via video sensors and lidar sensor, what's the hangup in recording such input and then replaying it faster?
You need something that mostly works most of the time, and has guardrails so when it makes mistakes nothing bad happens.
Our brains acquire quite good heuristics for dealing with physical space without needing to experience all of physical reality.
A cat-level or child-level understanding of physical space is more immediately useful than a philosopher-level of understanding.
This made me a bit curious. Would you have any pointers to books/articles/search terms if one wanted to have a bit deeper look on this problem space and where we are?
At its root it is a cutting problem, like graph cutting but much more general because it includes things like non-trivial geometric types and relationships. Solving the cutting problem is necessary to efficiently shard/parallelize operations over the data models.
For classic scalar data models, representations that preserve the relationships have the same dimensionality as the underlying data model. A set of points in 2-dimensions can always be represented in 2-dimensions such that they satisfy the cutting problem (e.g. a quadtree-like representation).
For non-scalar types like rectangles, operations like equality and intersection are distinct and there are an unbounded number of relationships that must be preserved that touch on concepts like size and aspect ratio to satisfy cutting requirements. The only way to expose these additional relationships to cutting algorithms is to encode and embed these other relationships in a (much) higher dimensionality space and then cut that space instead.
The mathematically general case isn't computable but real-world data models don't need it to be. Several decades ago it was determined that if you constrain the properties of the data model tightly enough then it should be possible to systematically construct a finite high-dimensionality embedding for that data model such that it satisfies the cutting problem.
Unfortunately, the "should be possible" understates the difficulty. There is no computer science literature for how one might go about constructing these cuttable embeddings, not even for a narrow subset of practical cases. The activity is also primarily one of designing data structures and algorithms that can represent complex relationships among objects with shape and size in dimensions much greater than three, which is cognitively difficult. Many smart people have tried and failed over the years. It has a lot of subtlety and you need practical implementations to have good properties as software.
About 20 years ago, long before "big data", the iPhone, or any current software fashion, this and several related problems were the subject of an ambitious government research program. It was technically successful, demonstrably. That program was killed in the early 2010s for unrelated reasons and much of that research was semi-lost. It was so far ahead of its time that few people saw the utility of it. There are still people around that were either directly involved or learned the computer science second-hand from someone that was but there aren't that many left.
"We’ve known for well over a decade that you cannot cram real-world spatial dynamics into those models. It is a clear impedance mismatch" > What's the source that this is a physically impossible problem? Not sure what you mean by impedance mismatch but do you mean that it is unsolvable even with better techniques?
Your whole third paragraph could have been said about LLMs and isn't specific enough, so we'll skip that.
I don't really understand the other 2 paragraphs, what's this "dark state-of-the-art computer science" you speak of and what is this "area of research became highly developed over two decades but is not in the public literature" how is "the computer science required is so different from classical AI research"?
Yes, classic LLMs (like GPT) operate as sequence predictors with no inductive bias for space, causality, or continuity. They're optimized for language fluency, not physical grounding. But multimodal models like ViT, Flamingo, and Perceiver IO are a completely different lineage, even if they use transformers under the hood. They tokenize images (or video, or point clouds) into spatially-aware embeddings and preserve positional structure in ways that make them far more suited to spatial reasoning than pure text LLMs.
The supposed “impedance mismatch” is real for language-only models, but that’s not the frontier anymore. The field has already moved into architectures that integrate vision, text, and action. Look at Flamingo's vision-language fusion, or GPT-4o’s real-time audio-visual grounding — these are not mere LLMs with pictures bolted on. These are spatiotemporal attention systems with architectural mechanisms for cross-modal alignment.
You're also asserting that "no general-purpose representations of space exist" — but this neglects decades of work in computational geometry, graphics, physics engines, and more recently, neural fields and geometric deep learning. Sure, no universal solution exists (nor should we expect one), but practical approximations exist: voxel grids, implicit neural representations, object-centric scene graphs, graph neural networks, etc. These aren't perfect, but dismissing them as non-existent isn’t accurate.
Finally, your concern about who on the team understands these deep theoretical issues is valid. But the fact is: theoretical CS isn’t the bottleneck here — it’s scalable implementation, multimodal pretraining, and architectural experimentation. If anything, what we need isn’t more Solomonoff-style induction or clever data structures — it’s models grounded in perception and action.
The real mistake isn’t that people are trying to cram physical reasoning into LLMs. The mistake is in acting like all transformer models are LLMs, and ignoring the very active (and promising) space of multimodal models that already tackle spatial, embodied, and dynamical reasoning problems — albeit imperfectly.
So this is place were we must look. It starts with the sensing and the integration of that sensing. I am working at this problem since more than 10 years and I came to some results. I am not a real scientist but a true engineer and I am looking from that perspective quite intesely: The question that one must ask is: how do you define the outside physical world from the perspective of a biological sensing "device" ? what exactly are we "seeing" or "hearing"? So yes, working on that brought it further in defining the physical world.
Once this layer of "natural eye automat" is programmed behind a camera, it will spit out this crude geometry : the Spacial-data-bulk (SDB). This SDB is small data.
From now on, our programs will only do reason, not on data froms camera(s) but only on this small SBD.
This is how I see it.
The traditional metaphor of movement — stepping from point A to point B — is spatially intuitive but semantically impoverished. It ignores the continuity of direction, the embodiment of motion, and the nontriviality of turning. Quaternion-based traversal reintroduces these elements. It is not just more precise; it is more faithful to the mechanisms by which physical and virtual entities evolve through space. In other words objects 'become' the model.
https://github.com/VoxleOne/SpinStep/blob/main/docs/index.md
There is hope. Experimental observation is, that in most cases the coupled high dimensional dynamics almost collapses to low dimensional attractors.
The interesting thing about these is: If we apply a measurement function to their state and afterwards reconstruct a representation of their dynamics from the measurement by embedding, we get a faithful representation of the dynamics with respect to certain invariants.
Even better, suitable measurement functions are dense in function space so we can pick one at random and get a suitable one with probability one.
What can be glanced about the dynamics in terms of of these invariants can learned for certain, experience shows that we can usually also predict quite well.
There is a chain of embedding theorems by Takens and Sauer gradually broadening the scope of applicability from deterministic chaos towards stochasticly driven deterministic chaos.
Note embedding here is not what current computer science means by the word.
I spend most of my early adulthood doing theses things, would be cool to see them used once more.
Bill Peebles is right, naturalistic, physical laws can be learned in deep neural nets from videos.
OR
Fei-Fei Li is right, you need 3D point cloud videos.
Okay, if you think Bill Peebles is right, then all this stuff you are talking about doesn't matter anymore. Lots of great reasons Bill Peebles is probably right, biggest reason of all is that Veo, Sora etc. have really good physics understanding.
If you think Fei-Fei Li is right, you are going to be augmenting real world data sets with game engine content. You can exactly create whatever data you need, for whatever constraints, to train performantly. I don't think this data scalability concern is real.
A compelling reason you are wrong and Fei-Fei Li's specific bet on scalability is right is the existence of Waymo and Zoox. There are also NEW autonomous vehicle companies achieving things faster than Zoox and Waymo did, because a lot of spatial intelligence problems are actually regulatory/political, not scientific.
Where can I read more about this space? (particularly on the "we know that universal solutions can't exist" front)
Then again, not much that we "knew" a decade ago is still relevant today. Of course transformer networks have proven capable of representing spatial intelligence. How could they work with 2D images, if not?
Also "impedance mismatch" doesn't mean no go, but rather less efficient.
Developed by who? And for what purpose? Are we talking about overlap with stuff like missile guidance systems or targeting control systems or something, and kept confidential by the military-industrial complex? I'm having a hard time seeing many other scenarios that would explain a large body of people doing research in this area and then not publishing anything.
> I can’t identify anyone on the team that is an expert in a relevant area of computer science theory
Who is an expert on this theory then?
Isn't this essentially what the convolutional layers do in LeNet?
>We know that universal solutions can’t exist
Why not?
To reason spatially (and dynamically) the dependence of one object's position in space on other objects (and their motions and behaviors) adds up fast to complicate the model in ways that 95% of 2D static image analysis does not.
To me its totally obvious that we will have a plethora of very valuable startups who use RL techniques to solve realworld problems in practical areas of engineering .. and I just get blank stares when I talk about this :]
Ive stopped saying AI when I mean ML or RL .. because people equate LLMs with AI.
We need better ML / RL algos for CV tasks :
- detecting lines from pixels
- detecting geometry in pointclouds
- constructing 3D from stereo images, photogrammetry, 360 panoramas
These might be used by LLMs but are likely built using RL or 'classical' ML techniques, tapping into the vast parallel matmull compute we now have in GPUs / multicore CPUs, and NPUs. ==> For me it is more something like :
Source = crude video-or-photo pixels (to) ===> Find simple many rectangle-surface that are glued together one another.
This is, for me, how you really go easily to detecting rather complexes geometry of any room.Also, LLMs really suck at some basic tasks like counting the sides of a polygon.
Oh indeed, but thats not using tokens correctly. if you want to do that, then tokenise the number of polygons....
not in the defense sector, or aviation, or UAVS, automotive, etc. Any proper real-time vision task where you have to computationally interact with visual data is unsuited for LLMs.
Nobody controls a drone, missile or vehicle by taking a screenshot and sending it to ChatGPT and has it do math while it's on flight, anything that requires as the title of the thread says, spatial intelligence is unsuited for a language model
What's the equivalent of destroying everything around you while chasing another high, but for reckless VC?
I'm hopeful that VLMs will "fan out" into a lot of positive outcomes for computer vision.
On the other hand I just chatted with Opus 4 for the first time a few minutes ago and I am completely blown away.
Most people have proprioception - you know where the parts of your body are without looking. Close your eyes and you intuitively know where your hands and fingers are.
When parking a car, it helps to sort of sit in the drivers seat and look around the car. Turn your neck and look past the back seat where your rear tire would be. sense the edges of the car.
I think if you sort of develop this a bit you might "feel" where your car is intuitively when pulling into a parking space or parallel parking. (car-prioception?)
(but use your mirrors and backup camera anyway)
It's made me realize that objects are much further from the boundaries of my car when backing into a spot parallel parking. I would never think to get so close to another car if I had to only rely on my own senses.
With that said, I realize there's a significant number of people that are even poorer estimators of these distances than myself. I.e. those that won't pass through two cars even though to me it's obvious that they could easily pass.
I have to imagine a big part of this has to do with risk assessment and lack of risk-free practice opportunity IRL. Nobody is seeing how far they can push or train themselves in this regard when the consequences are to scratch up your car and others' cars. With the birdseye view I can actually do that now!
I have aphantasia but I would say that spatial reasoning is one of the things my brain is the best at
Either that or they're perfectly capable, they just don't care.
Will add condensed version here in half an hour.
I've made some progress on a PoC in 3D reconstruction - detecting planes, edges, pipes from pointclouds from lidar scans, eg : https://youtu.be/-o58qe8egS4 .. and am bootstrapping with in-house gigs as I build out the product.
Essentially it breaks down to a ton of matmulls, and I use a lot of tricks from pre-LLM ML .. this is a domain that perfectly fits RL.
The investors Ive talked to seem to understand that scan-to-cad is a real problem with a viable market - automating 5Bn / yr of manual click-labor. But they want to see traction in the form of early sales of the MVP, which is understandable, especially in the current regime of high interest rates.
Ive not been able to get across to potential investors the vast implications for robotics, AI, AR, VR, VFX that having better / faster / realtime 3D reconstruction will bring. Its great that someone of the caliber of Fei-Fei Li is talking about it.
Robots that interact in the real world will need to make a 3D model in realtime and likely share it efficiently with comrades.
While a gaussian splat model is more efficient than a pointcloud, a model which recognizes a wall as a quad plane is much more efficient still, and needed for realtime communication. There is the old idea that compression is equivalent to AI.
What is stopping us from having a google street-view v3.0 in which I can zoom right into and walk around a shopping mall, or train station or public building ? Our browsers can do this now, essentially rendering quake like 3D environments - the problem is with turning a scan into a lightweight 3D model.
Photogrammetry, where you have hundreds of photos and reconstruct the 3D scene, uses a lot of compute, and the colmap / Structure-from-Motion algorithm predates newer ML approaches and is ripe for a better RL algorithm imo. Ive done experiments where you can manually model a 3D scene from well positioned 360 panorama photos of a building, picking corners, following the outline of walls to make a floorplan etc ... this should be amenable to an RL algorithm. Most 360 panorama photo tours have enough overlap to reconstruct the scene reasonably well.
I have no doubt that we are on the brink of a massive improvement in 3D processing. Its clearly solvable with the ML/RL approaches we currently have .. we dont need AGI. My problem is getting funding to work on it fulltime, equivalently talking an investor into taking that bet :)
Most of the stuff I have been working with has been aimed at low power consumption. One of the things that really helped is not bothering with dense reconstruction at all.
things like scenescript and SpaRP where instead of trying to capture all the geometry (like photogrammetry) the essential dimensions are captured and either outputted to a text description (scene script) or a simple model with decent normals (SpaRP)
Humans don't really keep complex dense reconstructions in our head. Its all about spatial relationships of landmarks.
Regarding what you say of planes and compression, you can look into metric-based surface remeshing. Essentially, you estimate surface curvature (second derivatives) and use that to distort length computations, remeshing your surface to length one in that distorted space, which then yields optimal DoFs to surface approximation error. A plane (or straight line) has 0 curvature so lengths are infinite along it (hence final DoFs there minimal). There's software to do that already, thought I'm not sure it's robust to your usecase, because they've been developed for scientific computing with meshes generated from CAD (presumably smoother than your point cloud meshes).
I'd be really curious to know more about the type of workflow you're interested in, i.e. what does your input look like (do you use some open data sets as well?) and what you hope for in the end (mesh, CAD).
Efficient re-meshings are important, and its worth improving on the current algorithms to get crisper breaklines etc, but you really want to go a step further and do what humans do manually now when they make a CAD model from a pointcloud - ie. convert it to its most efficient / compressed / simple useful format, where a wall face is recognized as a simple plane. Even remeshing and flat triangle tesselation can be improved a lot by ML techniques.
As with pointclouds, likewise with 'photogrammetry', where you reconstruct a 3D scene from hundreds of photos, or from 360 panoramas or stereo photos. I think in the next 18 months ML will be able to reconstruct an efficient 3D model from a streetview scene, or 360 panorama tour of a building. An optimized mesh is good for visualization in a web browser, but its even more useful to have a CAD style model where walls are flat quads, edges are sharp and a door is tagged as a door etc.
Perhaps the points Im trying to make are :
- the normal techniques are useful but not quite enough [ heuristics, classical CV algorithms, colmap/SfM ]
- NeRFs and gaussian splats are amazing innovations, but dont quite get us there
- to solve 3D reconstruction, from pointclouds or photos, we need ML to go beyond our normal heuristics : 3D reality is complicated
- ML, particularly RL, will likely solve 3D reconstruction quite soon, for useful things like buildings
- this will unlock a lot of value across many domains - AEC / construction, robotics, VR / AR
- there is low hanging fruit, such as my algo detecting planes and pipes in a pointcloud
- given the progress and the promise, we should be seeing more investment in this area [ 2Mn of investment could potentially unlock 10Bn/yr in value ]edit:typo
- 15GB of pointcloud data ( 100Mn xyzRGB points from a lidar laser scanner )
- 3 GB of 360 panorama photos
- 50MB obj 3D textured model
- 2MB CAD model
Im guessing gaussian-splat would be something like 20x to 40x more efficient than the pointcloud.
I achieved similar compression for building scans, using flat textured mini-planes.I'm particularly hung up on the data problem she touched on (41 min). She rightly points out that unlike language, where we could bootstrap LLMs with the vast, pre-existing corpus of the internet, there's no equivalent "internet of 3D space." She mentions a "hybrid approach" for World Labs, and that's where the real engineering challenge seems to lie.
My mind immediately goes to the trade-offs. If you lean heavily on synthetic data, you're in a constant battle with the "sim-to-real" gap. It works for narrow domains, but for a general "world model," the physics, lighting, and material properties have to be perfect, which is a monumental task. If you lean on real-world capture (e.g., massive-scale photogrammetry, NeRFs, etc.), the MLOps and data pipeline challenges seem staggering. We're not just talking text files; we're talking about petabytes of structured, multi-sensor data that needs to be processed, aligned, and labeled. It feels like an entirely new class of data infrastructure problem.
Her hiring philosophy of "intellectual fearlessness" (31 min) makes a lot of sense in this context. You'd need a team that's not intimidated by the fact that the foundational dataset for their entire field doesn't even exist yet. They have to build the oil refinery while also figuring out where to drill for oil.
It's exciting to see a team with this much deep learning and computer vision firepower aimed at such a foundational problem. It pulls the conversation away from just optimizing existing architectures and towards creating entirely new categories. It leaves me wondering: what does the "AlexNet moment" for spatial intelligence even look like? Is it a novel model architecture, or is the true breakthrough a new form of data representation that makes this problem tractable at scale?
Also, there’s this weird thing in culture (is it US only?) that whenever an interviewer brings up (even implicitly) the guest’s age, the guest has to make some quip about it as if they’re offended or sensitive to it. So I wouldn’t interpret even a slightly defensive comment about age as an “obsession.”
Given her intellectual stature, Professor Li likely was one of the strongest minds in any room she found herself in and, for the first half of her life, also one of the youngest voices.
Now that she’s entering mid-life, she’s still one of the most powerful minds, but no longer one of the youngest.
It’s something middle-aged thinkers can’t help but notice.
For the rest of us, we can only be grateful to share space and time with such gifted thinkers.
Coincidentally, today is Professor Li’s birthday! [0] I hope I will be around to see many more 3rds of July.
[0] Maybe her coming birthday was on her mind, hence the frequency of her remarks about her relative age.
Fei-Fei Li is known for the creation of ImageNet, which is certainly transformative in the field of computer vision. But the crux of it is painstaking grunt work to create the vast labeled dataset. Fei-Fei Li is a leader who mobilized vast resources and people hours to create this vast dataset. Certainly worth a ton of acclaim. But to claim she's the most brilliant mind in an entire room is a stretch.
Typical? Probably not, but hardly relevant to the truthiness of the claim.
https://community.openai.com/t/time-awareness-in-ai-why-temp...
https://boraerbasoglu.medium.com/the-impact-of-ais-lack-of-t...
Happy to answer questions if you're curious. PS. still in early beta, so please be gentle!
Do you actually pass the images to the model, or just the metadata/stats?
Of course don't make the mistake that we need anything like a human body, or any singular object containing 'intelligence'. That's simply the way nature had to do it to connect a sensor platform to a brain. AI seems much more like it will be a hive mind and distributed system of data collection.
Baby chicks can do bipedal balance pretty much as soon as they dry off.
Wood ducks can visually imprint very soon after hatching and drying off, a couple hours after birth with very limited visual data up until then and no interspersed sleep cycles.
We as humans have natural reactions to snake like shapes etc. even before encountering the danger of them or learning about it from social cues. Babies
> "trilobite"
The trilobite ancestor had a nervous system before it had an eye. It was able to make decisions and interact with the environment before the ability to see or speak a language.
It feels to me like this basic step is still missing. We haven't even crossed the first AI frontier yet.
Enough said.
Here's an on the fly video I made (no retakes) of Claude generating a Godot scene file.
However I’ve been trying to use LLM, both as orchestrators and in other cases to write code for 2D optimization problems with many spatial relationships and it has done terribly.
I have talking it can generate 1000s of lines over many rounds of prompting/iteration that solve maybe 30% of the problem (and the 30% very easy cases) while completely messing up the rest. When doing that code myself, in less than 1000 lines, the “30% part” was maybe 3% of the total code. Even when basically providing pseudo code to solve specific part of the problem chances are these LLM solutions would also have many blind spots or issues.
The thing is, that is a 2D problem for which there basically no ressources about online, and all the slightly similar problems all have careful handcrafted specialized solutions. So I think it has no good frame of reference how to solve the problem
Before any questions could be asked, the presenter said “OK, I need to run to give this presentation at the World Economic Forum in Davos now.”, and quite literally ran out of the room.
Once that happens it’s all over.
Is it just me?
for e.g. the form of communication used by bees is very well known now, it involves not just spatial movements but also "buzzing" which is totally similar tot he sounds we make, they just lack vocal cords.