Voyager – An interactive video generation model with realtime 3D reconstruction
World modeling: 2D vs 3D and human perception
- Strong pushback against the idea that “human perception is 2D.”
- Commenters stress multi-sensory, multi-dimensional perception: stereo vision, monocular depth cues, proprioception, vestibular system, touch and even distributed muscle sensors contributing to a 3D (or higher‑D) internal model.
- Debate over whether individual receptors are 0D/1D/2D, but broad agreement that the perceived world is 3D+time, not flat images.
- For AI, some argue you can stick to 2D views and let models implicitly learn depth; others advocate richer inputs (stereo, multi-view) to make learning 3D structure easier. “Bitter Lesson” is invoked on both sides (either as argument for not hand‑encoding 3D, or as irrelevant to data richness).
Capabilities, limitations, and use cases
- Many see this as a notable step beyond older “2D background + sprite” tricks and prior image‑to‑3D attempts that quickly break.
- Enthusiasm for VR/AR and “holodeck”-style experiences, but skepticism about current feasibility: high-res, 120fps, stereo, low latency, and consistent geometry are still far off.
- Some propose precomputing 3D scenes from photos for VR, games, or Flight Simulator–like worlds, or reconstructing navigable scenes from street‑level imagery.
- Others discuss niche uses (e.g., reconstructing riverbeds from partial data), with caveats that generative hallucinations may be unacceptable for scientific or engineering tasks.
- There is confusion over whether this can “replace LiDAR”; the consensus is no—this is generative, not direct measurement.
Quality, consistency, and “world model” skepticism
- Multiple commenters note that demo clips are short, narrow FOV, and never do a full 360° spin; they see this as a red flag for true object persistence.
- Depth maps and 3D point fusion could, in theory, enable full rotations, but inconsistencies across frames would cause blur and artifacts.
Hardware demands and practicality
- 60GB GPU RAM for 540p is viewed as extremely heavy; some see this as research‑only for now, others note cloud GPUs and multi‑GPU setups as workarounds.
License, “open source,” and regional bans
- Many stress this is not open source in the usual sense: custom license, no training data, restrictions on improving other models, MAU thresholds requiring Tencent’s approval.
- Debate on what the “preferred form of modification” is: weights vs training data.
- Exclusion of EU, UK, and South Korea is widely attributed to AI/data regulation risk (esp. the EU AI Act), seen by some as justified caution and by others as “malicious compliance” or anti‑competitive.
- Acceptable use policy (no misinformation, elections influence, military, etc.) is seen by some as reasonable guardrails, by others as unenforceable or self‑contradictory.