Anthropic agrees to pay $1.5B to settle lawsuit with book authors

Nature of the case & what was actually punished

  • Many commenters stress this lawsuit was about piracy, not about whether training on copyrighted books is fair use.
  • Anthropic allegedly downloaded large “shadow library” datasets (LibGen, Books3, PiLiMi), then later bought physical books and destructively scanned them.
  • Settlement terms (as extracted from filings):
    • $1.5B fund, estimated ~$3,000 per copyrighted work (500k works; more money if more works are proven).
    • Destruction of pirated datasets from shadow libraries.
    • Release only for past infringement on listed works, not for future training or for model outputs.

Fair use and model training

  • A prior ruling by the judge found that training on legally acquired books was fair use and “transformative”; the illegal act was downloading pirated copies.
  • Several participants underline: settlement creates no binding precedent, but the earlier district ruling is now persuasive authority others will cite.
  • Others argue fair use was never meant for massive LLM training, and that “reading” vs. “perfect recall & regurgitation” remains unresolved in other cases (e.g., Meta, OpenAI).

Economic & strategic takes

  • Many see $1.5B as a “cheap” price for having rushed ahead using pirated data, given Anthropic’s multi‑tens‑of‑billions funding and valuation.
  • Some think investors likely pushed to settle to remove existential downside and avoid an appellate precedent.
  • Debate over proportionality: $3,000 per $30 book seems high to some, but others note statutory damages can reach $150,000 per work, so this is a discount.

Impact on competitors & open source

  • Widespread speculation about pressure on OpenAI, Meta, Microsoft; some think this effectively “prices in” book piracy as a one‑off cost of doing business.
  • Concern that only giant, well‑funded players can now afford clean book corpora (buy + scan), further squeezing startups and open‑source efforts.
  • Some fear this accelerates consolidation; others argue data cost is still tiny compared to compute.

Books, libraries & data sourcing debates

  • Long subthread on whether buying/borrowing physical books then scanning them is ethically/legally different from torrents, and whether this is “scalable.”
  • Comparisons to Google Books and the Internet Archive:
    • Google’s scanning for search/preview was upheld as fair use; IA’s full book lending remains contested.
    • Commenters note irony that destructive scanning for AI is OK while non‑AI archives are punished.

Ethics, corruption & “move fast” culture

  • Strong resentment toward the “break the law at scale, pay later” startup playbook, with analogies to Uber and other tech firms that used illegality as a growth strategy.
  • Some argue this normalizes a regime where only rich entities can afford to violate the law, then settle—eroding the social contract and confidence in institutions.

Authors’ perspective & payouts

  • Authors in the thread actively look up whether their works are in LibGen and register with the settlement site; some note they may earn more from this than from sales.
  • Dispute over who really benefits: large publishers vs individual authors; many expect much of the money to go to rights‑holding corporations, not creators.

International & future legal landscape

  • Discussion of jurisdictions (EU text‑and‑data‑mining exceptions, Japan, Singapore, Switzerland) where training may be broadly allowed if data is lawfully accessed.
  • Some foresee countries explicitly carving out AI‑training exceptions to attract AI companies, while others warn that Chinese labs, less constrained by Western copyright, may gain a long‑term data advantage.
  • Ongoing uncertainty flagged: future rulings on outputs (regurgitation, style emulation), contract‑based restrictions (EULAs barring training), and new litigation (e.g., NYT‑style cases) are still “live.”