It sure looks like Meta stole a lot of books to build its AI

Legality, Fair Use, and How Data Was Obtained

  • Major thread on whether training on pirated books is more legally problematic than training on legitimately scanned books (e.g., library scans in prior cases).
  • One side: training is “reading,” highly transformative, and thus fair use regardless of being done by a machine or for profit; courts have previously allowed automated uses like thumbnails and snippet search.
  • Other side: how the works were obtained (torrenting, known-infringing shadow libraries) should matter; courts may view willful acquisition of illegal copies as aggravating, possibly justifying punitive damages.
  • Dispute over whether using AI to regurgitate training text crosses into infringement versus merely “learning” from it.

Human vs AI Learning Analogy

  • Pro-AI camp argues AI should be allowed to learn from texts like humans do, without per-work licensing, and that “machines vs humans” shouldn’t change fair use analysis.
  • Critics stress scale, automation, and corporate control: a for-profit system ingesting millions of works is not ethically or legally equivalent to an individual reading.
  • Some note that humans can’t practically reproduce entire books verbatim but models sometimes can, which weakens the analogy.

Copyright, Morality, and Compensation

  • Many participants see the current copyright regime as overgrown yet still necessary to support writers and artists.
  • Some want this case to weaken copyright generally; others want it to curb “IP laundering” by large firms while preserving protection for individual creators.
  • Proposals floated: book-purchase requirements, library-like access rules, streaming-style collective compensation; none detailed.
  • Disagreement on damages: some think, at most, models should pay the retail price per book; others argue that would incentivize systematic “stealing” and must be penalized more harshly.

Assessment of Meta and Article Tone

  • Several comments criticize Meta’s pattern of “copy first, ask later,” seeing this case as continuation of past behavior.
  • Others call the article emotionally charged or biased, especially early references to old scandals and political framing.

Scraping, Site Defenses, and Data Poisoning

  • Discussion of anti-selection JavaScript and paywalls as weak defenses against sophisticated scrapers.
  • Ideas raised about adversarial content or “poisoned” pages to corrupt AI training data, compared humorously to SEO.