The Curse of Recursion: Training on generated data makes models forget (2023)
Nature of Synthetic vs Real Data
- Many argue the core issue isn’t “synthetic” per se but low‑quality, lossy, self‑generated data.
- Fiction (e.g., novels) is defended as real data about language and culture, not “simulated” worlds.
- Others insist that correctly generated synthetic data can be useful, e.g., game self‑play, simulations, or CGI images, but only if grounded in real distributions.
Information Loss, Entropy, and Feedback Loops
- Several comments frame recursive training as repeated application of a lossy, non‑invertible function, inevitably degrading information.
- References to data processing inequality and entropy: you can’t “cheat” physics; repeatedly compressing/compressing‑like transforms causes drift toward noise or blandness.
- Counterpoint: lossy transforms can sometimes help (denoising, structure extraction), so “loss = worse” isn’t universally true.
Model Collapse and Mitigations
- Broad agreement: purely replacing real data with model outputs leads to “model collapse” and degradation.
- Follow‑up research is cited: if synthetic generations are accumulated alongside original real data, collapse is avoided and performance is bounded.
- Some suggest quality filters, human feedback, and metadata (scores, links, timelines) can help exclude junk outputs from future training.
Human Learning Analogies and Limits
- Debate over whether humans are “immune”: most say no—science progresses by adding new experiments (new data) and discarding errors.
- Comparison: repeated human teaching works because it’s grounded in a stable external reality; current LLMs lack continuous real‑world interaction.
Use Cases for Synthetic Data
- Synthetic data can work well in narrow, supervised tasks (e.g., balancing labels in classification, distillation from larger to smaller models).
- Concern that over‑reliance on upscaled or hallucinated data (e.g., “enhanced” license plates) introduces false information and serious downstream risk.
Data Monopolies and Detection
- Consensus that fresh, genuine human interaction data becomes more valuable as the web fills with AI text.
- Large platforms with deep tracking and engagement signals are seen as having a major advantage.
- Some call for dedicated detectors and provenance systems, but others expect an ongoing arms race with no perfect solution.