2025-07-01

Fei-Fei Li: Spatial intelligence is the next frontier in AI [video]

LLMs vs Computer Vision and Spatial AI

Several commenters feel LLM hype has drained jobs, funding, and mindshare from computer vision, RL, and robotics, despite CVPR-style research continuing.
Others note strong recent CV progress (e.g., segmentation, depth, NeRFs, Gaussian splats) and argue LLM advances indirectly accelerate vision via better tooling and compute.
Some sectors (defense, aviation, UAVs, automotive) still depend on classic, real‑time vision; LLMs are seen as unsuitable for tight spatial control loops.
A minority frame the LLM wave as an opportunity: less competition to innovate in under‑funded CV/3D areas.

Spatial Reasoning Limitations of Current Models

Multiple concrete failures reported: LLMs mis-handle basic spatial relationships in geolocation tasks, 2D optimization, CAD/OpenSCAD code, and even counting polygon sides.
In a detailed geolocation case, the model could identify the city/area from a low‑quality image but repeatedly failed to place crosswalks and buildings consistently in a bird’s‑eye schematic, despite step‑by‑step corrections.
Text-to-image pipelines are seen as especially weak: the text understanding may be fine, but translation into coherent spatial layouts often collapses.

Is Spatial AI Fundamentally Harder?

One line of argument: real‑world spatiotemporal dynamics are sparse, nonlinear and structurally different from sequence prediction; existing public CS literature lacks general, scalable representations of arbitrary spatial relationships.
This commenter references non‑public government research into high‑dimensional “cutting” data structures for complex geometry and claims universal solutions cannot exist.
Others push back, citing practical successes (video models, NeRFs, 3D Gaussians, geometric methods) and questioning both the “impossible in principle” framing and the reliance on undocumented “dark” research.
Debate emerges over whether transformer‑based multimodal models already provide a viable path to spatial reasoning, or whether deeper theoretical breakthroughs in data structures are needed.

3D Reconstruction and Scan‑to‑CAD

Several practitioners describe work on detecting planes, edges, and pipes from point clouds, compressing large scans into efficient CAD‑like models.
There is optimism that RL/ML can soon outperform classical photogrammetry and SfM (e.g., COLMAP) for buildings and indoor scenes, unlocking value across construction, robotics, AR/VR, and mapping.
Funding remains challenging: investors want near‑term traction, while researchers emphasize broader, longer‑term implications.

Data, Embodiment, and Environments

Commenters pick up on Li’s “no internet of 3D space” point: spatial AI lacks an equivalent of massive text corpora.
Two main data strategies are discussed:
- Synthetic/game‑engine worlds: scalable but plagued by sim‑to‑real gaps.
- Real‑world capture (multi-sensor, multi-view): realistic but creates huge MLOps challenges around storage, alignment, labeling, and representation.
Some argue intelligence must be embodied and embedded in an environment; proposals include fleets of simple robots gathering experience in shared “playpens,” or highly realistic simulations.
Others note that humans function with coarse heuristics; a “child‑level” spatial understanding may be useful long before precise physical world models are achieved.

Human Spatial Intelligence Analogies

People discuss wide individual variation in spatial skills, aphantasia without spatial deficits, and “car‑proprioception” when parking.
There is debate on how much spatial ability is innate vs learned; examples from animals (chicks, horses, ducks) are cited as evidence of hard‑wired spatial/visual competencies, with some skepticism and counter‑links.

Reactions to the Talk and Li’s Role

Many praise the talk as a rare, de‑hyped framing of what comes after language‑centric AI, especially her focus on spatial intelligence and data problems.
Her hiring emphasis on “intellectual fearlessness” is seen as appropriate for building entirely new datasets and infrastructures.
A side thread discusses her remarks about age; some view them as natural context, others see mild over‑emphasis, and there is minor debate over the extent of her “genius” status.

Related topics