AGI is not multimodal

Scope and meaning of AGI and “intelligence”

  • Many argue “AGI” is poorly defined or even meaningless; likewise for “intelligence” itself, since there is no agreed test that definitively requires it.
  • Some suggest humans aren’t truly “general” either, since each person only covers a subset of domains, yet the human brain architecture is general and can be specialized.
  • Others propose a practical definition: an architecture that can be cheaply copied and fine‑tuned across many tasks is “functionally AGI,” in which case current large models may already cover a big fraction of what’s needed.

Embodiment, world models, and the physical environment

  • A major thread supports the article’s claim that general intelligence requires being situated in and acting on an environment (physical or rich digital), learning through interaction and consequences.
  • Examples: self‑driving failures where “everyone just knows” the dynamics of vehicles; developmental psychology notions of enriched environments, episodic memory, and executive control.
  • Some broaden embodiment to any agent loop with perception and action, including purely digital office or simulation environments, while others insist physical reality is uniquely constraining and essential.

Multimodality and senses

  • Many commenters say AGI “must” be multimodal (at least vision, audio, etc.), but disagree on whether it needs all human senses (smell, taste) or human‑like ranges.
  • Disability analogies (blind or deaf humans) are used on both sides: to argue senses are not essential to intelligence, or to argue they shape the scope and kind of understanding.
  • There is interest in non‑human modalities (infrared, EM fields) and the idea that more modes increase the ability to correlate and generalize.

LLMs, next‑token prediction, and world models

  • One side criticizes the article’s framing of LLMs as “just next‑token predictors” and notes that this interface does not constrain internal computation; long coherent outputs imply substantial internal planning and semantic structure.
  • Others stress that current models operate over symbol statistics, not grounded experience; coherence and “reasoning” may be sophisticated mimicry of human text rather than learned causal world models.
  • Debate continues over whether semantics can ultimately “reduce” to patterns over symbols, or whether sensorimotor grounding and continuous experience of time are indispensable.

Research directions, capital, and practicality

  • Some recount attempts at embodied AI (e.g., robotics startups) that burned large sums without clear success, contrasting that with immediate commercial value of LLMs, which pulls funding away from harder embodiment work.
  • A few see the paper as mostly re‑stating that “we need to understand how to build AGI to build AGI,” with little concrete technical guidance; others find it a valuable synthesis that pushes beyond naive multimodal “model‑gluing” and scale‑only strategies.