Diffusion models are real-time game engines

What the system actually does

  • Trains a Stable Diffusion 1.4–based model on ~900M–1B Doom frames plus recorded actions.
  • At inference, predicts the next frame conditioned on a short history of previous frames (with noise added) and player actions.
  • Noise during training is said to be critical to reduce “auto-regressive drift” and enforce temporal consistency.
  • Several commenters stress: the diffusion model itself is stateless; any apparent memory comes from conditioning on recent frames and inputs.

Interactivity and evaluation (contested)

  • The project page and some parts of the paper say humans can play at ~20 FPS and that videos are “real-time recordings of people playing”.
  • Others read the paper as primarily training on RL agents and only evaluating via short, non-interactive clips shown to raters.
  • Overall: community consensus is that interactive play is intended and probably exists, but the paper’s description of human gameplay is seen as unclear or underspecified.

Limitations and artifacts

  • No explicit global world state: enemies respawn or shift, objects appear/disappear, ammo/health counters fluctuate, geometry “swims” or changes when revisited.
  • Backtracking often reveals major inconsistencies (walls move, pickups change), likened to dreams or hallucinations.
  • Model appears overfit to particular maps and bot-like movement; unusual player behavior may cause rapid breakdown.
  • Counting and rule-consistency (e.g., damage ticks in slime, shots-to-kill) are unreliable.

Is this really a “game engine”?

  • Many argue it’s more like:
    • “The world’s least efficient video codec.”
    • A dreamlike Doom emulator / interactive video, not a reusable engine.
  • Key critique: a game engine should:
    • Work on new content, not only mimic a specific game.
    • Expose and control rules, state, and assets; this model only maps [recent pixels + inputs] → pixels.
  • Others counter that if it can drive interactive play, it qualifies functionally as an engine, just learned rather than coded.

Compute, compression, and prior art

  • Strong contrast drawn between Doom’s tiny original requirements and multi‑GB diffusion models on a TPU.
  • Commenters note huge redundancy: the model could trivially “store” the game many times over, so it’s more an inefficient learned replica than true compression.
  • Related earlier work like GameGAN is mentioned; this diffusion-based approach is seen as a more powerful but still narrow evolution.

Speculation and future directions

  • Ideas floated:
    • Use diffusion only as a renderer over a simple low‑poly or physics engine.
    • Train on many games or real‑world video to generate new game styles.
    • Apply similar predictive models to robotics, UI/OS rendering, or cloud gaming client-side prediction.
  • Several people connect this to predictive-coding theories of the brain and human dreaming.