Anthropic agrees to pay $1.5B to settle lawsuit with book authors
Nature of the case & what was actually punished
- Many commenters stress this lawsuit was about piracy, not about whether training on copyrighted books is fair use.
- Anthropic allegedly downloaded large “shadow library” datasets (LibGen, Books3, PiLiMi), then later bought physical books and destructively scanned them.
- Settlement terms (as extracted from filings):
$1.5B fund, estimated ~$3,000 per copyrighted work (500k works; more money if more works are proven).- Destruction of pirated datasets from shadow libraries.
- Release only for past infringement on listed works, not for future training or for model outputs.
Fair use and model training
- A prior ruling by the judge found that training on legally acquired books was fair use and “transformative”; the illegal act was downloading pirated copies.
- Several participants underline: settlement creates no binding precedent, but the earlier district ruling is now persuasive authority others will cite.
- Others argue fair use was never meant for massive LLM training, and that “reading” vs. “perfect recall & regurgitation” remains unresolved in other cases (e.g., Meta, OpenAI).
Economic & strategic takes
- Many see $1.5B as a “cheap” price for having rushed ahead using pirated data, given Anthropic’s multi‑tens‑of‑billions funding and valuation.
- Some think investors likely pushed to settle to remove existential downside and avoid an appellate precedent.
- Debate over proportionality: $3,000 per $30 book seems high to some, but others note statutory damages can reach $150,000 per work, so this is a discount.
Impact on competitors & open source
- Widespread speculation about pressure on OpenAI, Meta, Microsoft; some think this effectively “prices in” book piracy as a one‑off cost of doing business.
- Concern that only giant, well‑funded players can now afford clean book corpora (buy + scan), further squeezing startups and open‑source efforts.
- Some fear this accelerates consolidation; others argue data cost is still tiny compared to compute.
Books, libraries & data sourcing debates
- Long subthread on whether buying/borrowing physical books then scanning them is ethically/legally different from torrents, and whether this is “scalable.”
- Comparisons to Google Books and the Internet Archive:
- Google’s scanning for search/preview was upheld as fair use; IA’s full book lending remains contested.
- Commenters note irony that destructive scanning for AI is OK while non‑AI archives are punished.
Ethics, corruption & “move fast” culture
- Strong resentment toward the “break the law at scale, pay later” startup playbook, with analogies to Uber and other tech firms that used illegality as a growth strategy.
- Some argue this normalizes a regime where only rich entities can afford to violate the law, then settle—eroding the social contract and confidence in institutions.
Authors’ perspective & payouts
- Authors in the thread actively look up whether their works are in LibGen and register with the settlement site; some note they may earn more from this than from sales.
- Dispute over who really benefits: large publishers vs individual authors; many expect much of the money to go to rights‑holding corporations, not creators.
International & future legal landscape
- Discussion of jurisdictions (EU text‑and‑data‑mining exceptions, Japan, Singapore, Switzerland) where training may be broadly allowed if data is lawfully accessed.
- Some foresee countries explicitly carving out AI‑training exceptions to attract AI companies, while others warn that Chinese labs, less constrained by Western copyright, may gain a long‑term data advantage.
- Ongoing uncertainty flagged: future rulings on outputs (regurgitation, style emulation), contract‑based restrictions (EULAs barring training), and new litigation (e.g., NYT‑style cases) are still “live.”