AI models collapse when trained on recursively generated data
Overall intuition about “model collapse”
- Many see the result as intuitive: LLMs are lossy compressors of their training data; recursively training on their own outputs further erodes information, especially in the tails of the distribution.
- Analogies used: photocopies of photocopies, VHS/JPEG re‑saving, echo chambers/navel‑gazing, inbreeding/incest, “breathing your own exhaust.”
- From a control‑theory / Markov‑chain perspective, unconstrained feedback loops are expected to drift and lose diversity or stability.
Synthetic data: good vs bad use
- Thread distinguishes “indiscriminate” reuse of model output from deliberate synthetic data generation.
- Synthetic data is already used by major labs (self‑play, RLHF, prover–verifier setups, curated problem sets) and is reported to work when:
- There is a clear fitness metric or verifier (e.g., math correctness, human raters).
- Generated data is filtered, edited, or selected by humans or other models.
- Without such external feedback, synthetic data can only rearrange existing information and tends to smooth away rare but important events.
Web scraping and AI contamination
- Concern that future web corpora will be heavily mixed with LLM‑generated text, making “indiscriminate” scraping dangerous.
- Detecting AI content reliably is seen as unsolved; rough filters and “AI detectors” may help at aggregate level but are imperfect.
- Some argue high‑quality, licensed, and educational data are becoming more important than raw web crawl; others worry AI‑assisted writing will still quietly pollute even “professional” sources.
Critiques of the Nature paper and theory
- Several commenters argue the experimental setup is unrealistic: repeatedly fine‑tuning on a fixed synthetic dataset from the same model resembles catastrophic forgetting, not how modern labs use synthetic data.
- Statistical objections: claims that collapse is mathematically inevitable are challenged with counter‑examples (e.g., normal distributions), though there is debate about finite‑sample effects and variance drift.
- Some criticize the publication venue’s ML track record, calling the work more of a warning about naive practices than a deep, general theorem.
Mitigations and open questions
- Proposed mitigations: human‑in‑the‑loop curation, external ground truth, discriminators/verifiers, better quality filters, and maintaining a base of fresh human data.
- Disagreement remains on how serious “model collapse” is in practice: some think frontier labs already control it; others see systemic risks, especially for uncontrolled web‑scale training.