2025-01-21

It sure looks like Meta stole a lot of books to build its AI

Legality, Fair Use, and How Data Was Obtained

Major thread on whether training on pirated books is more legally problematic than training on legitimately scanned books (e.g., library scans in prior cases).
One side: training is “reading,” highly transformative, and thus fair use regardless of being done by a machine or for profit; courts have previously allowed automated uses like thumbnails and snippet search.
Other side: how the works were obtained (torrenting, known-infringing shadow libraries) should matter; courts may view willful acquisition of illegal copies as aggravating, possibly justifying punitive damages.
Dispute over whether using AI to regurgitate training text crosses into infringement versus merely “learning” from it.

Human vs AI Learning Analogy

Pro-AI camp argues AI should be allowed to learn from texts like humans do, without per-work licensing, and that “machines vs humans” shouldn’t change fair use analysis.
Critics stress scale, automation, and corporate control: a for-profit system ingesting millions of works is not ethically or legally equivalent to an individual reading.
Some note that humans can’t practically reproduce entire books verbatim but models sometimes can, which weakens the analogy.

Copyright, Morality, and Compensation

Many participants see the current copyright regime as overgrown yet still necessary to support writers and artists.
Some want this case to weaken copyright generally; others want it to curb “IP laundering” by large firms while preserving protection for individual creators.
Proposals floated: book-purchase requirements, library-like access rules, streaming-style collective compensation; none detailed.
Disagreement on damages: some think, at most, models should pay the retail price per book; others argue that would incentivize systematic “stealing” and must be penalized more harshly.

Assessment of Meta and Article Tone

Several comments criticize Meta’s pattern of “copy first, ask later,” seeing this case as continuation of past behavior.
Others call the article emotionally charged or biased, especially early references to old scandals and political framing.

Scraping, Site Defenses, and Data Poisoning

Discussion of anti-selection JavaScript and paywalls as weak defenses against sophisticated scrapers.
Ideas raised about adversarial content or “poisoned” pages to corrupt AI training data, compared humorously to SEO.

Related topics