2024-12-01

The Curse of Recursion: Training on generated data makes models forget (2023)

Original Article ↗ Hacker News Discussion ↗

Nature of Synthetic vs Real Data

Many argue the core issue isn’t “synthetic” per se but low‑quality, lossy, self‑generated data.
Fiction (e.g., novels) is defended as real data about language and culture, not “simulated” worlds.
Others insist that correctly generated synthetic data can be useful, e.g., game self‑play, simulations, or CGI images, but only if grounded in real distributions.

Information Loss, Entropy, and Feedback Loops

Several comments frame recursive training as repeated application of a lossy, non‑invertible function, inevitably degrading information.
References to data processing inequality and entropy: you can’t “cheat” physics; repeatedly compressing/compressing‑like transforms causes drift toward noise or blandness.
Counterpoint: lossy transforms can sometimes help (denoising, structure extraction), so “loss = worse” isn’t universally true.

Model Collapse and Mitigations

Broad agreement: purely replacing real data with model outputs leads to “model collapse” and degradation.
Follow‑up research is cited: if synthetic generations are accumulated alongside original real data, collapse is avoided and performance is bounded.
Some suggest quality filters, human feedback, and metadata (scores, links, timelines) can help exclude junk outputs from future training.

Human Learning Analogies and Limits

Debate over whether humans are “immune”: most say no—science progresses by adding new experiments (new data) and discarding errors.
Comparison: repeated human teaching works because it’s grounded in a stable external reality; current LLMs lack continuous real‑world interaction.

Use Cases for Synthetic Data

Synthetic data can work well in narrow, supervised tasks (e.g., balancing labels in classification, distillation from larger to smaller models).
Concern that over‑reliance on upscaled or hallucinated data (e.g., “enhanced” license plates) introduces false information and serious downstream risk.

Data Monopolies and Detection

Consensus that fresh, genuine human interaction data becomes more valuable as the web fills with AI text.
Large platforms with deep tracking and engagement signals are seen as having a major advantage.
Some call for dedicated detectors and provenance systems, but others expect an ongoing arms race with no perfect solution.