2026-02-14

News publishers limit Internet Archive access due to AI scraping concerns

Motives for Blocking the Internet Archive & Crawlers

Publishers are increasingly blocking the Internet Archive (IA) and Common Crawl, especially large news sites; one estimate cited ~20% of major outlets, with smaller sites blocking less.
Stated reasons: AI training without consent/compensation, cost of serving heavy bot traffic, and protection of paywalls and syndication/archives businesses.
Several commenters argue AI is a convenient scapegoat; the real driver is paywall circumvention and preserving paid research/archive products sold to libraries.

Effectiveness and Unintended Consequences

Multiple people note that IA and “good” bots honor robots.txt, while determined AI scrapers will simply impersonate humans or use residential proxies, so blocks mainly hurt archivists and the public.
Some site owners report being hammered by poorly engineered AI scrapers (thousands of RPS, repeated recrawls of unchanged pages), prompting blanket AI blocks.
Others argue blocking IA will push AI companies to scrape sites individually anyway, increasing load; the “common man” loses access while well-capitalized actors adapt.

Impact on History, Science, and Compliance

Strong concern about eroding the public record: loss of news archives harms historians, legal evidence, accountability, and scientific reproducibility.
Examples: missing government guidance, ToS, and API docs where the Wayback Machine has been critical for audits and SOC 2–style compliance.
Some suggest legal requirements or fair-use carveouts for archival of publicly available content; others reply that serving bots costs real money and some content is paywalled by design.

Preservation vs Privacy and ‘Slop’

A minority welcomes less archiving, seeing permanent records as dangerous for individuals in changing political climates.
Others counter that most content is harmless and that societies can’t “learn from history” if it’s constantly erased.
There’s debate over whether preserving today’s largely AI- and clickbait-filled web is worth the storage; some predict pre-AI-era web snapshots will become especially valuable.

Alternative Archival Models & Tools

Suggestions include:
- Crowd-sourced archiving via browser extensions that save pages users actually visit.
- Volunteer projects (ArchiveTeam), self-hosted tools (ArchiveBox, Linkwarden), and hash-addressed or decentralized systems (IPFS, Nostr).
Challenges raised: privacy (fingerprinting in pages), verifying unmodified copies, and potential ToS violations.

Business Models, Copyright, and AI

Many see this as a business-model clash: AI systems capture value without linking back or sharing revenue, unlike traditional search engines.
Ideas floated: embargoed public archiving (e.g., after weeks/months), academic-only archives, or paid/licensed AI access to news back catalogs.
Others argue that if a business model depends on banning legal scraping of public pages, it may be unsustainable.

Related topics