The Curse of Recursion: Training on generated data makes models forget (2023)

Nature of Synthetic vs Real Data

  • Many argue the core issue isn’t “synthetic” per se but low‑quality, lossy, self‑generated data.
  • Fiction (e.g., novels) is defended as real data about language and culture, not “simulated” worlds.
  • Others insist that correctly generated synthetic data can be useful, e.g., game self‑play, simulations, or CGI images, but only if grounded in real distributions.

Information Loss, Entropy, and Feedback Loops

  • Several comments frame recursive training as repeated application of a lossy, non‑invertible function, inevitably degrading information.
  • References to data processing inequality and entropy: you can’t “cheat” physics; repeatedly compressing/compressing‑like transforms causes drift toward noise or blandness.
  • Counterpoint: lossy transforms can sometimes help (denoising, structure extraction), so “loss = worse” isn’t universally true.

Model Collapse and Mitigations

  • Broad agreement: purely replacing real data with model outputs leads to “model collapse” and degradation.
  • Follow‑up research is cited: if synthetic generations are accumulated alongside original real data, collapse is avoided and performance is bounded.
  • Some suggest quality filters, human feedback, and metadata (scores, links, timelines) can help exclude junk outputs from future training.

Human Learning Analogies and Limits

  • Debate over whether humans are “immune”: most say no—science progresses by adding new experiments (new data) and discarding errors.
  • Comparison: repeated human teaching works because it’s grounded in a stable external reality; current LLMs lack continuous real‑world interaction.

Use Cases for Synthetic Data

  • Synthetic data can work well in narrow, supervised tasks (e.g., balancing labels in classification, distillation from larger to smaller models).
  • Concern that over‑reliance on upscaled or hallucinated data (e.g., “enhanced” license plates) introduces false information and serious downstream risk.

Data Monopolies and Detection

  • Consensus that fresh, genuine human interaction data becomes more valuable as the web fills with AI text.
  • Large platforms with deep tracking and engagement signals are seen as having a major advantage.
  • Some call for dedicated detectors and provenance systems, but others expect an ongoing arms race with no perfect solution.