Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

Legal Status of Training on Copyrighted Books

  • Many comments focus on the judge’s finding: scanning owned books and using them to train models was “exceedingly transformative” and fair use, while downloading pirated copies was not.
  • Several point out this follows existing precedents (e.g., Google Books, search indexing): making internal digital copies and using them for a transformative tool can be fair use even when entire works are scanned.
  • Others argue the law is unsettled: only early lower‑court rulings exist, some other cases are less friendly to AI, and a Supreme Court test on LLM training/output is widely expected.

Piracy vs. Fair Use and the Judge’s Ruling

  • Key distinction drawn:
    • (1) Pirating books to build a digital library = clear infringement, possibly criminal at this scale.
    • (2) Scanning legitimately purchased books = fair use.
    • (3) Training on that internal corpus = fair use (in this ruling).
  • Buying a book does not license redistribution; the model is treated as a new, transformative work unless it reproduces “meaningful chunks” verbatim.
  • Some challenge the analogy to human learning and say courts are improperly anthropomorphizing LLMs.

Corporate Power, Double Standards, and Enforcement

  • Strong resentment over asymmetry: individuals have been ruined or jailed for relatively small-scale infringement, whereas a heavily funded AI company may face only manageable civil penalties.
  • Comparisons to past cases (software piracy sentences, Aaron Swartz, RIAA lawsuits) used to illustrate “one law for the rich, another for everyone else.”
  • Others note statutory damages (up to $150k/work) exist but are rarely applied at full theoretical scale; settlement and transaction costs dominate.

Impact on Authors and Future Creativity

  • One camp: training on unlicensed books and selling AI access economically undercuts authors (especially mid‑ and low‑tier), discouraging future writing and teaching.
  • Counter‑camp: many authors already earn little; people write mainly from intrinsic motivation; a single author’s marginal contribution to a trillion‑parameter model is negligible.
  • Ongoing tension between “no copyright on knowledge” and protection of specific expressions and markets.

Analogies and Precedents from Other Tech Sectors

  • Commenters cite Spotify, YouTube, Crunchyroll, cloud music lockers, and social platforms as past examples of “pirate first, legalize later” growth strategies; others dispute some of these histories as myths.
  • Search engines are a frequent analogy: they copy everything, store it, and show snippets, which courts found transformative—LLMs are argued to be similar or, by critics, more directly competitive.

Ethical and Philosophical Fault Lines

  • Debate over whether copyright infringement is “stealing,” “theft of service,” or a distinct, lesser category; some argue infringement can be worse than tangible theft, others the opposite.
  • Some hope AI pressure will force radical IP reform and expansion of the public domain; others fear AI giants will secure carve‑outs for themselves while IP remains strict for everyone else.
  • Disagreement on whether it’s morally acceptable to train on anything one has lawfully obtained, vs. a need for explicit licensing and revenue-sharing.

Destruction of Books and Data Transparency

  • Emotional reaction to millions of books being cut up; some see it as cultural loss, others as acceptable if copies aren’t unique and digital preservation occurs.
  • Several argue that the real systemic problem is non‑transparent datasets: without knowing exactly what went into training, claims about “fair use,” “zero‑shot,” and originality are impossible to evaluate.