2024-11-29

Core copyright violation moves ahead in The Intercept's lawsuit against OpenAI

Legal issues in this case

The only claim moving forward is under DMCA 17 U.S.C. §1202, about removing copyright management information (CMI) such as titles and bylines, not “core” infringement of training itself.
Commenters note §1202 can carry severe remedies: statutory damages per violation and even impoundment of equipment and destruction of infringing copies/models, though many doubt a court would go that far against a large company.
There is confusion/clarification around U.S. copyright:
- Copyright arises automatically at creation.
- Registration is required before suing for infringement and to access statutory damages and attorney’s fees.
- The DMCA CMI claim avoids the registration requirement and has its own statutory damages.
Some see the CMI claim as a “proxy” to get courts to confront training on copyrighted material without having to win the broader fair‑use fight immediately.

Training on copyrighted data vs fair use

One camp: training is akin to human learning; models transform huge corpora into non‑literal, probabilistic representations, so training should be legal, with infringement only when outputs closely reproduce protected works.
Opposing camp: copying works into training sets is itself unauthorized reproduction; model weights may be compressed copies; verbatim or near‑verbatim regurgitation and derivative outputs should trigger liability.
Analogies invoked: Google Books, Google Translate, Internet Archive, sampling in music, and clean‑room reverse engineering. Disagreement over whether these support or undermine OpenAI’s position.
Some argue stripping metadata looks like intent to hide infringement; others suspect it was a neutral preprocessing choice.

Ethical and economic arguments

Many see OpenAI’s scraping as exploitative: creators are uncompensated, attribution is stripped, and the resulting models are proprietary.
Others stress user benefits (productivity, education, accessibility) and fear over‑broad copyright enforcement could “neuter” models.
There’s a strong split between “information should be free” / hacker‑ethic views and those emphasizing creators’ moral rights, livelihoods, and incentives to produce high‑quality work.
Concern that if AI makes journalism and other creative work less viable, overall content quality and investigative reporting may decline.

Competition, geopolitics, and reform ideas

Some worry stricter U.S. rules will let China or other jurisdictions race ahead by training on everything; others reply that legality and ethics should not be sacrificed to geopolitical fears.
Suggestions include: models trained only on public‑domain/licensed data (as with some stock‑image AIs), compulsory licensing schemes or revenue taxes to fund creators, much shorter copyright terms, or even abolishing copyright entirely.
Several predict OpenAI will keep settling cases to avoid precedent; others argue at some point a definitive test case is inevitable.

Related topics