AI models collapse when trained on recursively generated data

Overall intuition about “model collapse”

  • Many see the result as intuitive: LLMs are lossy compressors of their training data; recursively training on their own outputs further erodes information, especially in the tails of the distribution.
  • Analogies used: photocopies of photocopies, VHS/JPEG re‑saving, echo chambers/navel‑gazing, inbreeding/incest, “breathing your own exhaust.”
  • From a control‑theory / Markov‑chain perspective, unconstrained feedback loops are expected to drift and lose diversity or stability.

Synthetic data: good vs bad use

  • Thread distinguishes “indiscriminate” reuse of model output from deliberate synthetic data generation.
  • Synthetic data is already used by major labs (self‑play, RLHF, prover–verifier setups, curated problem sets) and is reported to work when:
    • There is a clear fitness metric or verifier (e.g., math correctness, human raters).
    • Generated data is filtered, edited, or selected by humans or other models.
  • Without such external feedback, synthetic data can only rearrange existing information and tends to smooth away rare but important events.

Web scraping and AI contamination

  • Concern that future web corpora will be heavily mixed with LLM‑generated text, making “indiscriminate” scraping dangerous.
  • Detecting AI content reliably is seen as unsolved; rough filters and “AI detectors” may help at aggregate level but are imperfect.
  • Some argue high‑quality, licensed, and educational data are becoming more important than raw web crawl; others worry AI‑assisted writing will still quietly pollute even “professional” sources.

Critiques of the Nature paper and theory

  • Several commenters argue the experimental setup is unrealistic: repeatedly fine‑tuning on a fixed synthetic dataset from the same model resembles catastrophic forgetting, not how modern labs use synthetic data.
  • Statistical objections: claims that collapse is mathematically inevitable are challenged with counter‑examples (e.g., normal distributions), though there is debate about finite‑sample effects and variance drift.
  • Some criticize the publication venue’s ML track record, calling the work more of a warning about naive practices than a deep, general theorem.

Mitigations and open questions

  • Proposed mitigations: human‑in‑the‑loop curation, external ground truth, discriminators/verifiers, better quality filters, and maintaining a base of fresh human data.
  • Disagreement remains on how serious “model collapse” is in practice: some think frontier labs already control it; others see systemic risks, especially for uncontrolled web‑scale training.