Core copyright violation moves ahead in The Intercept's lawsuit against OpenAI
Legal issues in this case
- The only claim moving forward is under DMCA 17 U.S.C. §1202, about removing copyright management information (CMI) such as titles and bylines, not “core” infringement of training itself.
- Commenters note §1202 can carry severe remedies: statutory damages per violation and even impoundment of equipment and destruction of infringing copies/models, though many doubt a court would go that far against a large company.
- There is confusion/clarification around U.S. copyright:
- Copyright arises automatically at creation.
- Registration is required before suing for infringement and to access statutory damages and attorney’s fees.
- The DMCA CMI claim avoids the registration requirement and has its own statutory damages.
- Some see the CMI claim as a “proxy” to get courts to confront training on copyrighted material without having to win the broader fair‑use fight immediately.
Training on copyrighted data vs fair use
- One camp: training is akin to human learning; models transform huge corpora into non‑literal, probabilistic representations, so training should be legal, with infringement only when outputs closely reproduce protected works.
- Opposing camp: copying works into training sets is itself unauthorized reproduction; model weights may be compressed copies; verbatim or near‑verbatim regurgitation and derivative outputs should trigger liability.
- Analogies invoked: Google Books, Google Translate, Internet Archive, sampling in music, and clean‑room reverse engineering. Disagreement over whether these support or undermine OpenAI’s position.
- Some argue stripping metadata looks like intent to hide infringement; others suspect it was a neutral preprocessing choice.
Ethical and economic arguments
- Many see OpenAI’s scraping as exploitative: creators are uncompensated, attribution is stripped, and the resulting models are proprietary.
- Others stress user benefits (productivity, education, accessibility) and fear over‑broad copyright enforcement could “neuter” models.
- There’s a strong split between “information should be free” / hacker‑ethic views and those emphasizing creators’ moral rights, livelihoods, and incentives to produce high‑quality work.
- Concern that if AI makes journalism and other creative work less viable, overall content quality and investigative reporting may decline.
Competition, geopolitics, and reform ideas
- Some worry stricter U.S. rules will let China or other jurisdictions race ahead by training on everything; others reply that legality and ethics should not be sacrificed to geopolitical fears.
- Suggestions include: models trained only on public‑domain/licensed data (as with some stock‑image AIs), compulsory licensing schemes or revenue taxes to fund creators, much shorter copyright terms, or even abolishing copyright entirely.
- Several predict OpenAI will keep settling cases to avoid precedent; others argue at some point a definitive test case is inevitable.