Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

What Perplexity Is Alleged to Be Doing

  • Cloudflare claims Perplexity bypasses both robots.txt and explicit IP/user‑agent blocks by:
    • Using undeclared user agents that impersonate Chrome on macOS.
    • Rotating through IPs outside its published ranges.
  • Cloudflare’s honeypot test: they created new domains, blocked Perplexity’s declared bots and all crawlers, then asked Perplexity about those URLs and say it returned detailed page content anyway.
  • Some commenters argue the evidence is ambiguous: the screenshots look like on‑demand fetching of a single URL, not broad crawling; others note Perplexity’s own docs say “Perplexity‑User” generally ignores robots.txt for user‑initiated fetches.

Robots.txt, Crawling vs. Fetching

  • One camp: robots.txt is specifically for recursive crawlers; if a human asks an AI “what’s on this URL?” and it fetches only that page, that’s not a “crawler” and robots.txt doesn’t apply.
  • Counter‑camp: robots.txt has long been used as a general “rules for robots” convention; any automated agent, even per‑URL, should obey it if the site owner asks.
  • Concern: on‑demand fetches can still be cached, indexed, and folded into training pipelines, effectively becoming stealth crawling.
  • Additional worry: if AI agents hit many pages per query in parallel, the distinction between “fetcher” and “crawler” collapses at scale.

Consent, Control, and Property Rights

  • Many site operators assert: “It’s my server, I set the terms.” They want to:
    • Deny specific user agents (LLMs, ad‑blockers, etc.).
    • Prevent their content from being used for training or summarized without credit or upsell.
  • Others push back: once content is public, people (and their tools) can read and transform it; robots.txt is a courtesy, not access control.
  • Strong distrust of AI companies: repeated norm‑breaking around copyright and training leads to an assumption that any fetched content will be stored and reused.

Impact on Infrastructure and the “Open Web”

  • Several operators report large, costly traffic from AI scrapers, sometimes to the point of partial outages or having to remove sites.
  • Tools like Anubis, custom rate limiting, and IP blocking are being deployed; this often harms legitimate human users more than determined scrapers.
  • Some argue this will push valuable content behind logins/paywalls, shrinking the open web and leaving public space full of “AI slop.”

Cloudflare’s Role and Motives

  • Supportive view: Cloudflare is responding to real customer pain (bandwidth bills, DoS‑like crawling) and trying to enforce norms and “rules of the road.”
  • Skeptical view: this is marketing for Cloudflare’s anti‑AI and “pay‑per‑crawl” products, positioning itself as toll‑collector and gatekeeper over a large slice of the web.
  • Additional criticism: Cloudflare already blocks many benign human requests and pressures people toward JS‑heavy, tracking‑friendly browsers.

Monetization and Future Models

  • Broad agreement that current ad‑funded, SEO‑driven web is fragile; AI summarization further undermines pageview‑based revenue.
  • Proposed alternatives:
    • Micropayments or HTTP 402‑style “pay per page” or “pay per crawl.”
    • Spotify‑like “pay per citation” from LLMs to sources.
    • More content moving to subscriptions, newsletters, private communities.
  • Disagreement over whether AI might eventually destroy the attention‑ad model in a way that yields a better ecosystem or simply accelerates enclosure (walled gardens, DRM, remote attestation).