2025-02-07

Meta torrented & seeded 81.7 TB dataset containing copyrighted data

Dataset scale and sourcing

Commenters estimate 81.7 TB corresponds to many millions of books; some link it to LibGen/Sci-Hub torrents via Anna’s Archive (~90M files, ~1 MB each).
Debate over file sizes: plain-text ebooks vs large scanned PDFs with illustrations, charts, and image-only pages.
Some note Meta’s own LLaMA paper already acknowledges using Books3 (derived from a private tracker dump) plus other copyrighted corpora.
Internal messages quoted in the article show staff worried about torrenting from corporate IPs and configuring clients to “seed as little as possible,” prompting jokes about Meta being “leechers.”

Legality: training vs distribution

Thread distinguishes two issues:
- Training on copyrighted works (still a largely unresolved legal question; often argued as potential “fair use”).
- Downloading and seeding pirated torrents (clearly infringing distribution regardless of AI context).
Several calculate statutory damages (US minimum $750, up to $150k per work for willful infringement) and note that, at Meta’s scale, the theoretical numbers reach into the trillions—far beyond any realistic judgment.
Some expect a modest fine or settlement; others predict courts may effectively legalize large-scale training on unlicensed data to avoid “crippling” US AI competitiveness.

Two-tier justice and Aaron Swartz

Strong theme: contrast between Meta’s mass infringement and harsh treatment of individuals (Swartz, Megaupload, small-time torrenters).
Many see evidence of a “two-tier” legal system where corporations with lawyers and lobbyists get settlements, while individuals face ruinous penalties or prosecution.
Swartz’s case is revisited in detail; some stress prosecutorial overreach, others add nuance about plea deals, but most see the comparison as highlighting double standards.

Copyright, piracy, and precedent

Multiple historical analogies: YouTube’s early TV uploads, Google’s web indexing and book scanning, Spotify/Crunchyroll bootstrapping on pirated catalogs, Uber/Airbnb ignoring regulations.
Divided views:
- One camp wants stricter, evenly enforced IP laws and corporate accountability.
- Another argues current copyright (life+70, etc.) is “insane,” largely benefits big intermediaries, and should be radically shortened or abolished.
LibGen/Anna’s Archive are described by many as a “civilizational project”; preserving and democratizing access is seen as good, but monetizing that corpus via proprietary AI is more contentious.

Ethics of Meta and AI companies

Some focus on alleged deception: internal references to “stealth mode,” minimizing seeding, and potential misstatements in depositions (including by leadership).
Others argue Meta is relatively better than closed competitors because LLaMA weights are public, framing this as “software communism” versus OpenAI/Google’s proprietary models.
A recurring proposal: if a model is trained on unlicensed copyrighted data, its weights should be forced into the public domain or non-commercial-only use.

Broader structural critiques

Many tie this to a pattern where VC-backed firms “move fast and break laws,” then normalize their position via lobbying and settlements.
Several advocate either strong antitrust and IP enforcement against large firms, or—at the other extreme—using corporate overreach as leverage to dismantle or radically reform copyright itself.

Related topics