News publishers limit Internet Archive access due to AI scraping concerns
Motives for Blocking the Internet Archive & Crawlers
- Publishers are increasingly blocking the Internet Archive (IA) and Common Crawl, especially large news sites; one estimate cited ~20% of major outlets, with smaller sites blocking less.
- Stated reasons: AI training without consent/compensation, cost of serving heavy bot traffic, and protection of paywalls and syndication/archives businesses.
- Several commenters argue AI is a convenient scapegoat; the real driver is paywall circumvention and preserving paid research/archive products sold to libraries.
Effectiveness and Unintended Consequences
- Multiple people note that IA and “good” bots honor robots.txt, while determined AI scrapers will simply impersonate humans or use residential proxies, so blocks mainly hurt archivists and the public.
- Some site owners report being hammered by poorly engineered AI scrapers (thousands of RPS, repeated recrawls of unchanged pages), prompting blanket AI blocks.
- Others argue blocking IA will push AI companies to scrape sites individually anyway, increasing load; the “common man” loses access while well-capitalized actors adapt.
Impact on History, Science, and Compliance
- Strong concern about eroding the public record: loss of news archives harms historians, legal evidence, accountability, and scientific reproducibility.
- Examples: missing government guidance, ToS, and API docs where the Wayback Machine has been critical for audits and SOC 2–style compliance.
- Some suggest legal requirements or fair-use carveouts for archival of publicly available content; others reply that serving bots costs real money and some content is paywalled by design.
Preservation vs Privacy and ‘Slop’
- A minority welcomes less archiving, seeing permanent records as dangerous for individuals in changing political climates.
- Others counter that most content is harmless and that societies can’t “learn from history” if it’s constantly erased.
- There’s debate over whether preserving today’s largely AI- and clickbait-filled web is worth the storage; some predict pre-AI-era web snapshots will become especially valuable.
Alternative Archival Models & Tools
- Suggestions include:
- Crowd-sourced archiving via browser extensions that save pages users actually visit.
- Volunteer projects (ArchiveTeam), self-hosted tools (ArchiveBox, Linkwarden), and hash-addressed or decentralized systems (IPFS, Nostr).
- Challenges raised: privacy (fingerprinting in pages), verifying unmodified copies, and potential ToS violations.
Business Models, Copyright, and AI
- Many see this as a business-model clash: AI systems capture value without linking back or sharing revenue, unlike traditional search engines.
- Ideas floated: embargoed public archiving (e.g., after weeks/months), academic-only archives, or paid/licensed AI access to news back catalogs.
- Others argue that if a business model depends on banning legal scraping of public pages, it may be unsustainable.