Core copyright violation moves ahead in The Intercept's lawsuit against OpenAI

Legal issues in this case

  • The only claim moving forward is under DMCA 17 U.S.C. §1202, about removing copyright management information (CMI) such as titles and bylines, not “core” infringement of training itself.
  • Commenters note §1202 can carry severe remedies: statutory damages per violation and even impoundment of equipment and destruction of infringing copies/models, though many doubt a court would go that far against a large company.
  • There is confusion/clarification around U.S. copyright:
    • Copyright arises automatically at creation.
    • Registration is required before suing for infringement and to access statutory damages and attorney’s fees.
    • The DMCA CMI claim avoids the registration requirement and has its own statutory damages.
  • Some see the CMI claim as a “proxy” to get courts to confront training on copyrighted material without having to win the broader fair‑use fight immediately.

Training on copyrighted data vs fair use

  • One camp: training is akin to human learning; models transform huge corpora into non‑literal, probabilistic representations, so training should be legal, with infringement only when outputs closely reproduce protected works.
  • Opposing camp: copying works into training sets is itself unauthorized reproduction; model weights may be compressed copies; verbatim or near‑verbatim regurgitation and derivative outputs should trigger liability.
  • Analogies invoked: Google Books, Google Translate, Internet Archive, sampling in music, and clean‑room reverse engineering. Disagreement over whether these support or undermine OpenAI’s position.
  • Some argue stripping metadata looks like intent to hide infringement; others suspect it was a neutral preprocessing choice.

Ethical and economic arguments

  • Many see OpenAI’s scraping as exploitative: creators are uncompensated, attribution is stripped, and the resulting models are proprietary.
  • Others stress user benefits (productivity, education, accessibility) and fear over‑broad copyright enforcement could “neuter” models.
  • There’s a strong split between “information should be free” / hacker‑ethic views and those emphasizing creators’ moral rights, livelihoods, and incentives to produce high‑quality work.
  • Concern that if AI makes journalism and other creative work less viable, overall content quality and investigative reporting may decline.

Competition, geopolitics, and reform ideas

  • Some worry stricter U.S. rules will let China or other jurisdictions race ahead by training on everything; others reply that legality and ethics should not be sacrificed to geopolitical fears.
  • Suggestions include: models trained only on public‑domain/licensed data (as with some stock‑image AIs), compulsory licensing schemes or revenue taxes to fund creators, much shorter copyright terms, or even abolishing copyright entirely.
  • Several predict OpenAI will keep settling cases to avoid precedent; others argue at some point a definitive test case is inevitable.