2024-08-28

Diffusion models are real-time game engines

What the system actually does

Trains a Stable Diffusion 1.4–based model on ~900M–1B Doom frames plus recorded actions.
At inference, predicts the next frame conditioned on a short history of previous frames (with noise added) and player actions.
Noise during training is said to be critical to reduce “auto-regressive drift” and enforce temporal consistency.
Several commenters stress: the diffusion model itself is stateless; any apparent memory comes from conditioning on recent frames and inputs.

Interactivity and evaluation (contested)

The project page and some parts of the paper say humans can play at ~20 FPS and that videos are “real-time recordings of people playing”.
Others read the paper as primarily training on RL agents and only evaluating via short, non-interactive clips shown to raters.
Overall: community consensus is that interactive play is intended and probably exists, but the paper’s description of human gameplay is seen as unclear or underspecified.

Limitations and artifacts

No explicit global world state: enemies respawn or shift, objects appear/disappear, ammo/health counters fluctuate, geometry “swims” or changes when revisited.
Backtracking often reveals major inconsistencies (walls move, pickups change), likened to dreams or hallucinations.
Model appears overfit to particular maps and bot-like movement; unusual player behavior may cause rapid breakdown.
Counting and rule-consistency (e.g., damage ticks in slime, shots-to-kill) are unreliable.

Is this really a “game engine”?

Many argue it’s more like:
- “The world’s least efficient video codec.”
- A dreamlike Doom emulator / interactive video, not a reusable engine.
Key critique: a game engine should:
- Work on new content, not only mimic a specific game.
- Expose and control rules, state, and assets; this model only maps [recent pixels + inputs] → pixels.
Others counter that if it can drive interactive play, it qualifies functionally as an engine, just learned rather than coded.

Compute, compression, and prior art

Strong contrast drawn between Doom’s tiny original requirements and multi‑GB diffusion models on a TPU.
Commenters note huge redundancy: the model could trivially “store” the game many times over, so it’s more an inefficient learned replica than true compression.
Related earlier work like GameGAN is mentioned; this diffusion-based approach is seen as a more powerful but still narrow evolution.

Speculation and future directions

Ideas floated:
- Use diffusion only as a renderer over a simple low‑poly or physics engine.
- Train on many games or real‑world video to generate new game styles.
- Apply similar predictive models to robotics, UI/OS rendering, or cloud gaming client-side prediction.
Several people connect this to predictive-coding theories of the brain and human dreaming.

Related topics