2024-07-24

AI models collapse when trained on recursively generated data

Overall intuition about “model collapse”

Many see the result as intuitive: LLMs are lossy compressors of their training data; recursively training on their own outputs further erodes information, especially in the tails of the distribution.
Analogies used: photocopies of photocopies, VHS/JPEG re‑saving, echo chambers/navel‑gazing, inbreeding/incest, “breathing your own exhaust.”
From a control‑theory / Markov‑chain perspective, unconstrained feedback loops are expected to drift and lose diversity or stability.

Synthetic data: good vs bad use

Thread distinguishes “indiscriminate” reuse of model output from deliberate synthetic data generation.
Synthetic data is already used by major labs (self‑play, RLHF, prover–verifier setups, curated problem sets) and is reported to work when:
- There is a clear fitness metric or verifier (e.g., math correctness, human raters).
- Generated data is filtered, edited, or selected by humans or other models.
Without such external feedback, synthetic data can only rearrange existing information and tends to smooth away rare but important events.

Web scraping and AI contamination

Concern that future web corpora will be heavily mixed with LLM‑generated text, making “indiscriminate” scraping dangerous.
Detecting AI content reliably is seen as unsolved; rough filters and “AI detectors” may help at aggregate level but are imperfect.
Some argue high‑quality, licensed, and educational data are becoming more important than raw web crawl; others worry AI‑assisted writing will still quietly pollute even “professional” sources.

Critiques of the Nature paper and theory

Several commenters argue the experimental setup is unrealistic: repeatedly fine‑tuning on a fixed synthetic dataset from the same model resembles catastrophic forgetting, not how modern labs use synthetic data.
Statistical objections: claims that collapse is mathematically inevitable are challenged with counter‑examples (e.g., normal distributions), though there is debate about finite‑sample effects and variance drift.
Some criticize the publication venue’s ML track record, calling the work more of a warning about naive practices than a deep, general theorem.

Mitigations and open questions

Proposed mitigations: human‑in‑the‑loop curation, external ground truth, discriminators/verifiers, better quality filters, and maintaining a base of fresh human data.
Disagreement remains on how serious “model collapse” is in practice: some think frontier labs already control it; others see systemic risks, especially for uncontrolled web‑scale training.

Related topics