Voyager – An interactive video generation model with realtime 3D reconstruction

World modeling: 2D vs 3D and human perception

  • Strong pushback against the idea that “human perception is 2D.”
  • Commenters stress multi-sensory, multi-dimensional perception: stereo vision, monocular depth cues, proprioception, vestibular system, touch and even distributed muscle sensors contributing to a 3D (or higher‑D) internal model.
  • Debate over whether individual receptors are 0D/1D/2D, but broad agreement that the perceived world is 3D+time, not flat images.
  • For AI, some argue you can stick to 2D views and let models implicitly learn depth; others advocate richer inputs (stereo, multi-view) to make learning 3D structure easier. “Bitter Lesson” is invoked on both sides (either as argument for not hand‑encoding 3D, or as irrelevant to data richness).

Capabilities, limitations, and use cases

  • Many see this as a notable step beyond older “2D background + sprite” tricks and prior image‑to‑3D attempts that quickly break.
  • Enthusiasm for VR/AR and “holodeck”-style experiences, but skepticism about current feasibility: high-res, 120fps, stereo, low latency, and consistent geometry are still far off.
  • Some propose precomputing 3D scenes from photos for VR, games, or Flight Simulator–like worlds, or reconstructing navigable scenes from street‑level imagery.
  • Others discuss niche uses (e.g., reconstructing riverbeds from partial data), with caveats that generative hallucinations may be unacceptable for scientific or engineering tasks.
  • There is confusion over whether this can “replace LiDAR”; the consensus is no—this is generative, not direct measurement.

Quality, consistency, and “world model” skepticism

  • Multiple commenters note that demo clips are short, narrow FOV, and never do a full 360° spin; they see this as a red flag for true object persistence.
  • Depth maps and 3D point fusion could, in theory, enable full rotations, but inconsistencies across frames would cause blur and artifacts.

Hardware demands and practicality

  • 60GB GPU RAM for 540p is viewed as extremely heavy; some see this as research‑only for now, others note cloud GPUs and multi‑GPU setups as workarounds.

License, “open source,” and regional bans

  • Many stress this is not open source in the usual sense: custom license, no training data, restrictions on improving other models, MAU thresholds requiring Tencent’s approval.
  • Debate on what the “preferred form of modification” is: weights vs training data.
  • Exclusion of EU, UK, and South Korea is widely attributed to AI/data regulation risk (esp. the EU AI Act), seen by some as justified caution and by others as “malicious compliance” or anti‑competitive.
  • Acceptable use policy (no misinformation, elections influence, military, etc.) is seen by some as reasonable guardrails, by others as unenforceable or self‑contradictory.