The current state of the theory that GPL propagates to AI models
License vs. copyright, and the role of fair use
- Much of the debate is framed as copyright, not contract: if training is fair use, license terms (GPL, MIT, proprietary) may not bite at all.
- Some argue that in the US, training on legally obtained public material is already treated as fair use, making license type irrelevant.
- Others push back: fair use is US‑specific, limited or absent elsewhere, and not clearly settled for LLMs; litigation is ongoing and outcomes may diverge by domain and jurisdiction.
GPL enforceability and “virality”
- Commenters distinguish between enforcing GPL on GPL code itself (well‑tested) vs enforcing “propagation” to larger combined works (much less tested).
- Several note that GPL doesn’t magically relicense other code; it simply withholds permission to use GPL code unless distribution conditions are met.
- Enforcement history (BusyBox, Cisco, French judgments) is cited as supporting GPL’s robustness, but mostly on straightforward distribution violations, not on exotic propagation theories.
Does GPL propagate to models or outputs?
- Many doubt that models trained on GPL code become GPL themselves, or that all outputs inherit GPL terms; that’s seen as an extreme, legally unsupported position.
- Others argue that if a model can reproduce GPL’d code (or large chunks of copyrighted text) on demand, that looks like copying, not mere “learning.”
- Analogy disputes: some equate training to humans learning from code; others stress that LLMs are stored, redistributable artifacts, unlike human brains.
New license ideas and free‑software tensions
- Proposals include licenses that forbid AI training entirely, or allow it only if resulting models and weights are open.
- Critics say such clauses would violate “freedom 0” and likely be non‑free; under GPLv3 they might also count as “further restrictions.”
- Others suspect courts would treat anti‑training clauses as void where training is fair use, or require contract‑style click‑through instead of pure copyright licenses.
Proof, training data, and “copyright laundering”
- A recurring concern: models act as “copyright‑laundering machines” – mining open and copyleft code into proprietary services with little traceability.
- People ask how to prove a model used GPL/AGPL data, and conversely how to prove that particular outputs are clean.
- Suggested mechanisms: discovery in litigation, training‑data disclosure mandates, model inversion / extraction research, or requiring published datasets.
Policy, reform, and community reaction
- Some want legislative clarification or shorter copyright terms plus opt‑in public datasets with royalties.
- Others distrust new laws, pointing to DMCA‑style capture by large firms, and prefer courts refining fair‑use boundaries.
- There is visible disillusionment: some stop contributing to OSS, feeling licenses are ignored; others embrace LLMs as transformative productivity tools, deepening the values split inside the developer community.