Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record
Crawling, Archiving, and Robots.txt
- Internet Archive itself doesn’t use distributed residential crawlers; a separate group (ArchiveTeam) does via its “Warrior” system.
- Archivists often ignore robots.txt, arguing it was meant for search engines and is misused to block legitimate archiving.
- Some site operators say Archive.org and ArchiveTeam show little regard for robots.txt and can be quite aggressive, though ArchiveTeam claims to slow/stop when they overload sites.
News Publishers, AI Scraping, and Economics
- Many argue publishers blocking Internet Archive are primarily trying to stop AI scraping and protect subscription/ad revenue.
- Others say this inevitably also blocks archivists; allowing archives but blocking AI is technically and economically hard.
- There’s skepticism that publishers would donate archives when AI firms will pay for data; archiving competes with licensing revenue.
Proposed Compromises for Archiving
- Common suggestion: “archive now, release later” (weeks, years, or decades) to protect fresh revenue while preserving history.
- Disagreement centers on what delay is “reasonable” and how to prevent AI firms from using archives as a backdoor.
Archive.is and Ethics of Defense Tactics
- Some see archive.is as a necessary, resilient alternative when other archives are blocked.
- Others strongly criticize the operator for allegedly using visitors’ traffic in a hidden DDoS to fight a doxxing attempt, calling it a betrayal of user trust.
- Defenders argue DDoS was a desperate response to a real threat to anonymity; critics counter that conscripting users is never acceptable.
AI Training Value of News Content
- One side claims news text is a tiny fraction of web data and not that important for LLMs.
- Others argue that high‑quality journalism is disproportionately valuable compared to “junk” web content, especially for factual, real‑world knowledge.
Defending Against Scrapers
- Operators report severe load from AI crawlers and describe technical defenses: JA3/JA4 TLS fingerprinting, TCP/HTTP fingerprinting, per‑UA rules, and IP blocking.
- Concerns that increasingly sophisticated evasion (randomized fingerprints, human‑like browser automation) will make blocking nearly impossible.
- Suggestions include mTLS or signed crawler requests so known archivists can bypass generic bot blocks.
Future of the Public Web
- Some think AI scraping is ultimately unstoppable; anything publicly served will be archived and reused.
- Proposed responses:
- Move sensitive/valuable content off the open web into private or DRM‑protected spaces.
- Use content‑addressable storage or P2P systems so scrapers hit shared caches instead of origin servers.
- Accept that scraping is the “price” of an open web, and focus on reducing load rather than stopping it.
Critiques of EFF, Archives, and Tech Ideals
- Several commenters say the EFF understates the role archives have already played in enabling commercial LLM training and prior copyright conflicts (e.g., e‑book lawsuits).
- Some argue tech “utopian” projects (open source, public archives) repeatedly become free input for extractive business models, eroding privacy and creator control.
- Others accuse news organizations of exaggerating harms to justify restricting access, or note that archives also enable paywall bypassing, which publishers understandably dislike.
Media Power, History, and Erasure
- Commenters both criticize major outlets (e.g., for past war coverage and alignment with state narratives) and still insist their archives are historically crucial.
- One view: by blocking archiving, outlets risk self‑erasure from the historical record and long‑term irrelevance.
- Another view: given propaganda and capture, perhaps it’s not obvious that preserving every article forever is unambiguously good.