News publishers limit Internet Archive access due to AI scraping concerns

Motives for Blocking the Internet Archive & Crawlers

  • Publishers are increasingly blocking the Internet Archive (IA) and Common Crawl, especially large news sites; one estimate cited ~20% of major outlets, with smaller sites blocking less.
  • Stated reasons: AI training without consent/compensation, cost of serving heavy bot traffic, and protection of paywalls and syndication/archives businesses.
  • Several commenters argue AI is a convenient scapegoat; the real driver is paywall circumvention and preserving paid research/archive products sold to libraries.

Effectiveness and Unintended Consequences

  • Multiple people note that IA and “good” bots honor robots.txt, while determined AI scrapers will simply impersonate humans or use residential proxies, so blocks mainly hurt archivists and the public.
  • Some site owners report being hammered by poorly engineered AI scrapers (thousands of RPS, repeated recrawls of unchanged pages), prompting blanket AI blocks.
  • Others argue blocking IA will push AI companies to scrape sites individually anyway, increasing load; the “common man” loses access while well-capitalized actors adapt.

Impact on History, Science, and Compliance

  • Strong concern about eroding the public record: loss of news archives harms historians, legal evidence, accountability, and scientific reproducibility.
  • Examples: missing government guidance, ToS, and API docs where the Wayback Machine has been critical for audits and SOC 2–style compliance.
  • Some suggest legal requirements or fair-use carveouts for archival of publicly available content; others reply that serving bots costs real money and some content is paywalled by design.

Preservation vs Privacy and ‘Slop’

  • A minority welcomes less archiving, seeing permanent records as dangerous for individuals in changing political climates.
  • Others counter that most content is harmless and that societies can’t “learn from history” if it’s constantly erased.
  • There’s debate over whether preserving today’s largely AI- and clickbait-filled web is worth the storage; some predict pre-AI-era web snapshots will become especially valuable.

Alternative Archival Models & Tools

  • Suggestions include:
    • Crowd-sourced archiving via browser extensions that save pages users actually visit.
    • Volunteer projects (ArchiveTeam), self-hosted tools (ArchiveBox, Linkwarden), and hash-addressed or decentralized systems (IPFS, Nostr).
  • Challenges raised: privacy (fingerprinting in pages), verifying unmodified copies, and potential ToS violations.

Business Models, Copyright, and AI

  • Many see this as a business-model clash: AI systems capture value without linking back or sharing revenue, unlike traditional search engines.
  • Ideas floated: embargoed public archiving (e.g., after weeks/months), academic-only archives, or paid/licensed AI access to news back catalogs.
  • Others argue that if a business model depends on banning legal scraping of public pages, it may be unsustainable.