Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record

Crawling, Archiving, and Robots.txt

  • Internet Archive itself doesn’t use distributed residential crawlers; a separate group (ArchiveTeam) does via its “Warrior” system.
  • Archivists often ignore robots.txt, arguing it was meant for search engines and is misused to block legitimate archiving.
  • Some site operators say Archive.org and ArchiveTeam show little regard for robots.txt and can be quite aggressive, though ArchiveTeam claims to slow/stop when they overload sites.

News Publishers, AI Scraping, and Economics

  • Many argue publishers blocking Internet Archive are primarily trying to stop AI scraping and protect subscription/ad revenue.
  • Others say this inevitably also blocks archivists; allowing archives but blocking AI is technically and economically hard.
  • There’s skepticism that publishers would donate archives when AI firms will pay for data; archiving competes with licensing revenue.

Proposed Compromises for Archiving

  • Common suggestion: “archive now, release later” (weeks, years, or decades) to protect fresh revenue while preserving history.
  • Disagreement centers on what delay is “reasonable” and how to prevent AI firms from using archives as a backdoor.

Archive.is and Ethics of Defense Tactics

  • Some see archive.is as a necessary, resilient alternative when other archives are blocked.
  • Others strongly criticize the operator for allegedly using visitors’ traffic in a hidden DDoS to fight a doxxing attempt, calling it a betrayal of user trust.
  • Defenders argue DDoS was a desperate response to a real threat to anonymity; critics counter that conscripting users is never acceptable.

AI Training Value of News Content

  • One side claims news text is a tiny fraction of web data and not that important for LLMs.
  • Others argue that high‑quality journalism is disproportionately valuable compared to “junk” web content, especially for factual, real‑world knowledge.

Defending Against Scrapers

  • Operators report severe load from AI crawlers and describe technical defenses: JA3/JA4 TLS fingerprinting, TCP/HTTP fingerprinting, per‑UA rules, and IP blocking.
  • Concerns that increasingly sophisticated evasion (randomized fingerprints, human‑like browser automation) will make blocking nearly impossible.
  • Suggestions include mTLS or signed crawler requests so known archivists can bypass generic bot blocks.

Future of the Public Web

  • Some think AI scraping is ultimately unstoppable; anything publicly served will be archived and reused.
  • Proposed responses:
    • Move sensitive/valuable content off the open web into private or DRM‑protected spaces.
    • Use content‑addressable storage or P2P systems so scrapers hit shared caches instead of origin servers.
    • Accept that scraping is the “price” of an open web, and focus on reducing load rather than stopping it.

Critiques of EFF, Archives, and Tech Ideals

  • Several commenters say the EFF understates the role archives have already played in enabling commercial LLM training and prior copyright conflicts (e.g., e‑book lawsuits).
  • Some argue tech “utopian” projects (open source, public archives) repeatedly become free input for extractive business models, eroding privacy and creator control.
  • Others accuse news organizations of exaggerating harms to justify restricting access, or note that archives also enable paywall bypassing, which publishers understandably dislike.

Media Power, History, and Erasure

  • Commenters both criticize major outlets (e.g., for past war coverage and alignment with state narratives) and still insist their archives are historically crucial.
  • One view: by blocking archiving, outlets risk self‑erasure from the historical record and long‑term irrelevance.
  • Another view: given propaganda and capture, perhaps it’s not obvious that preserving every article forever is unambiguously good.