Perplexity is using stealth, undeclared crawlers to evade no-crawl directives
What Perplexity Is Alleged to Be Doing
- Cloudflare claims Perplexity bypasses both robots.txt and explicit IP/user‑agent blocks by:
- Using undeclared user agents that impersonate Chrome on macOS.
- Rotating through IPs outside its published ranges.
- Cloudflare’s honeypot test: they created new domains, blocked Perplexity’s declared bots and all crawlers, then asked Perplexity about those URLs and say it returned detailed page content anyway.
- Some commenters argue the evidence is ambiguous: the screenshots look like on‑demand fetching of a single URL, not broad crawling; others note Perplexity’s own docs say “Perplexity‑User” generally ignores robots.txt for user‑initiated fetches.
Robots.txt, Crawling vs. Fetching
- One camp: robots.txt is specifically for recursive crawlers; if a human asks an AI “what’s on this URL?” and it fetches only that page, that’s not a “crawler” and robots.txt doesn’t apply.
- Counter‑camp: robots.txt has long been used as a general “rules for robots” convention; any automated agent, even per‑URL, should obey it if the site owner asks.
- Concern: on‑demand fetches can still be cached, indexed, and folded into training pipelines, effectively becoming stealth crawling.
- Additional worry: if AI agents hit many pages per query in parallel, the distinction between “fetcher” and “crawler” collapses at scale.
Consent, Control, and Property Rights
- Many site operators assert: “It’s my server, I set the terms.” They want to:
- Deny specific user agents (LLMs, ad‑blockers, etc.).
- Prevent their content from being used for training or summarized without credit or upsell.
- Others push back: once content is public, people (and their tools) can read and transform it; robots.txt is a courtesy, not access control.
- Strong distrust of AI companies: repeated norm‑breaking around copyright and training leads to an assumption that any fetched content will be stored and reused.
Impact on Infrastructure and the “Open Web”
- Several operators report large, costly traffic from AI scrapers, sometimes to the point of partial outages or having to remove sites.
- Tools like Anubis, custom rate limiting, and IP blocking are being deployed; this often harms legitimate human users more than determined scrapers.
- Some argue this will push valuable content behind logins/paywalls, shrinking the open web and leaving public space full of “AI slop.”
Cloudflare’s Role and Motives
- Supportive view: Cloudflare is responding to real customer pain (bandwidth bills, DoS‑like crawling) and trying to enforce norms and “rules of the road.”
- Skeptical view: this is marketing for Cloudflare’s anti‑AI and “pay‑per‑crawl” products, positioning itself as toll‑collector and gatekeeper over a large slice of the web.
- Additional criticism: Cloudflare already blocks many benign human requests and pressures people toward JS‑heavy, tracking‑friendly browsers.
Monetization and Future Models
- Broad agreement that current ad‑funded, SEO‑driven web is fragile; AI summarization further undermines pageview‑based revenue.
- Proposed alternatives:
- Micropayments or HTTP 402‑style “pay per page” or “pay per crawl.”
- Spotify‑like “pay per citation” from LLMs to sources.
- More content moving to subscriptions, newsletters, private communities.
- Disagreement over whether AI might eventually destroy the attention‑ad model in a way that yields a better ecosystem or simply accelerates enclosure (walled gardens, DRM, remote attestation).