2025-09-03

Voyager – An interactive video generation model with realtime 3D reconstruction

World modeling: 2D vs 3D and human perception

Strong pushback against the idea that “human perception is 2D.”
Commenters stress multi-sensory, multi-dimensional perception: stereo vision, monocular depth cues, proprioception, vestibular system, touch and even distributed muscle sensors contributing to a 3D (or higher‑D) internal model.
Debate over whether individual receptors are 0D/1D/2D, but broad agreement that the perceived world is 3D+time, not flat images.
For AI, some argue you can stick to 2D views and let models implicitly learn depth; others advocate richer inputs (stereo, multi-view) to make learning 3D structure easier. “Bitter Lesson” is invoked on both sides (either as argument for not hand‑encoding 3D, or as irrelevant to data richness).

Capabilities, limitations, and use cases

Many see this as a notable step beyond older “2D background + sprite” tricks and prior image‑to‑3D attempts that quickly break.
Enthusiasm for VR/AR and “holodeck”-style experiences, but skepticism about current feasibility: high-res, 120fps, stereo, low latency, and consistent geometry are still far off.
Some propose precomputing 3D scenes from photos for VR, games, or Flight Simulator–like worlds, or reconstructing navigable scenes from street‑level imagery.
Others discuss niche uses (e.g., reconstructing riverbeds from partial data), with caveats that generative hallucinations may be unacceptable for scientific or engineering tasks.
There is confusion over whether this can “replace LiDAR”; the consensus is no—this is generative, not direct measurement.

Quality, consistency, and “world model” skepticism

Multiple commenters note that demo clips are short, narrow FOV, and never do a full 360° spin; they see this as a red flag for true object persistence.
Depth maps and 3D point fusion could, in theory, enable full rotations, but inconsistencies across frames would cause blur and artifacts.

Hardware demands and practicality

60GB GPU RAM for 540p is viewed as extremely heavy; some see this as research‑only for now, others note cloud GPUs and multi‑GPU setups as workarounds.

License, “open source,” and regional bans

Many stress this is not open source in the usual sense: custom license, no training data, restrictions on improving other models, MAU thresholds requiring Tencent’s approval.
Debate on what the “preferred form of modification” is: weights vs training data.
Exclusion of EU, UK, and South Korea is widely attributed to AI/data regulation risk (esp. the EU AI Act), seen by some as justified caution and by others as “malicious compliance” or anti‑competitive.
Acceptable use policy (no misinformation, elections influence, military, etc.) is seen by some as reasonable guardrails, by others as unenforceable or self‑contradictory.

Related topics