2025-08-04

Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

What Perplexity Is Alleged to Be Doing

Cloudflare claims Perplexity bypasses both robots.txt and explicit IP/user‑agent blocks by:
- Using undeclared user agents that impersonate Chrome on macOS.
- Rotating through IPs outside its published ranges.
Cloudflare’s honeypot test: they created new domains, blocked Perplexity’s declared bots and all crawlers, then asked Perplexity about those URLs and say it returned detailed page content anyway.
Some commenters argue the evidence is ambiguous: the screenshots look like on‑demand fetching of a single URL, not broad crawling; others note Perplexity’s own docs say “Perplexity‑User” generally ignores robots.txt for user‑initiated fetches.

Robots.txt, Crawling vs. Fetching

One camp: robots.txt is specifically for recursive crawlers; if a human asks an AI “what’s on this URL?” and it fetches only that page, that’s not a “crawler” and robots.txt doesn’t apply.
Counter‑camp: robots.txt has long been used as a general “rules for robots” convention; any automated agent, even per‑URL, should obey it if the site owner asks.
Concern: on‑demand fetches can still be cached, indexed, and folded into training pipelines, effectively becoming stealth crawling.
Additional worry: if AI agents hit many pages per query in parallel, the distinction between “fetcher” and “crawler” collapses at scale.

Consent, Control, and Property Rights

Many site operators assert: “It’s my server, I set the terms.” They want to:
- Deny specific user agents (LLMs, ad‑blockers, etc.).
- Prevent their content from being used for training or summarized without credit or upsell.
Others push back: once content is public, people (and their tools) can read and transform it; robots.txt is a courtesy, not access control.
Strong distrust of AI companies: repeated norm‑breaking around copyright and training leads to an assumption that any fetched content will be stored and reused.

Impact on Infrastructure and the “Open Web”

Several operators report large, costly traffic from AI scrapers, sometimes to the point of partial outages or having to remove sites.
Tools like Anubis, custom rate limiting, and IP blocking are being deployed; this often harms legitimate human users more than determined scrapers.
Some argue this will push valuable content behind logins/paywalls, shrinking the open web and leaving public space full of “AI slop.”

Cloudflare’s Role and Motives

Supportive view: Cloudflare is responding to real customer pain (bandwidth bills, DoS‑like crawling) and trying to enforce norms and “rules of the road.”
Skeptical view: this is marketing for Cloudflare’s anti‑AI and “pay‑per‑crawl” products, positioning itself as toll‑collector and gatekeeper over a large slice of the web.
Additional criticism: Cloudflare already blocks many benign human requests and pressures people toward JS‑heavy, tracking‑friendly browsers.

Monetization and Future Models

Broad agreement that current ad‑funded, SEO‑driven web is fragile; AI summarization further undermines pageview‑based revenue.
Proposed alternatives:
- Micropayments or HTTP 402‑style “pay per page” or “pay per crawl.”
- Spotify‑like “pay per citation” from LLMs to sources.
- More content moving to subscriptions, newsletters, private communities.
Disagreement over whether AI might eventually destroy the attention‑ad model in a way that yields a better ecosystem or simply accelerates enclosure (walled gardens, DRM, remote attestation).

Related topics