Diffusion models are real-time game engines
What the system actually does
- Trains a Stable Diffusion 1.4–based model on ~900M–1B Doom frames plus recorded actions.
- At inference, predicts the next frame conditioned on a short history of previous frames (with noise added) and player actions.
- Noise during training is said to be critical to reduce “auto-regressive drift” and enforce temporal consistency.
- Several commenters stress: the diffusion model itself is stateless; any apparent memory comes from conditioning on recent frames and inputs.
Interactivity and evaluation (contested)
- The project page and some parts of the paper say humans can play at ~20 FPS and that videos are “real-time recordings of people playing”.
- Others read the paper as primarily training on RL agents and only evaluating via short, non-interactive clips shown to raters.
- Overall: community consensus is that interactive play is intended and probably exists, but the paper’s description of human gameplay is seen as unclear or underspecified.
Limitations and artifacts
- No explicit global world state: enemies respawn or shift, objects appear/disappear, ammo/health counters fluctuate, geometry “swims” or changes when revisited.
- Backtracking often reveals major inconsistencies (walls move, pickups change), likened to dreams or hallucinations.
- Model appears overfit to particular maps and bot-like movement; unusual player behavior may cause rapid breakdown.
- Counting and rule-consistency (e.g., damage ticks in slime, shots-to-kill) are unreliable.
Is this really a “game engine”?
- Many argue it’s more like:
- “The world’s least efficient video codec.”
- A dreamlike Doom emulator / interactive video, not a reusable engine.
- Key critique: a game engine should:
- Work on new content, not only mimic a specific game.
- Expose and control rules, state, and assets; this model only maps [recent pixels + inputs] → pixels.
- Others counter that if it can drive interactive play, it qualifies functionally as an engine, just learned rather than coded.
Compute, compression, and prior art
- Strong contrast drawn between Doom’s tiny original requirements and multi‑GB diffusion models on a TPU.
- Commenters note huge redundancy: the model could trivially “store” the game many times over, so it’s more an inefficient learned replica than true compression.
- Related earlier work like GameGAN is mentioned; this diffusion-based approach is seen as a more powerful but still narrow evolution.
Speculation and future directions
- Ideas floated:
- Use diffusion only as a renderer over a simple low‑poly or physics engine.
- Train on many games or real‑world video to generate new game styles.
- Apply similar predictive models to robotics, UI/OS rendering, or cloud gaming client-side prediction.
- Several people connect this to predictive-coding theories of the brain and human dreaming.