Cloudflare crawl endpoint
Scope and capabilities
- New
/crawlendpoint uses Cloudflare’s Browser Rendering (headless Chrome) to fetch and render pages, including JS-heavy SPAs. - Can crawl any publicly accessible site, not just Cloudflare-hosted ones.
- Main advantage cited: abstracts away browser lifecycle headaches (Puppeteer/Playwright cold starts, context reuse, timeouts).
- Useful outputs mentioned: structured JSON, HTML, markdown; potential for synthetic monitoring, agents, and archival-style mirroring.
Robots.txt, bot protection, and identification
- Cloudflare states the crawler honors
robots.txt, includingcrawl-delay, and is subject to the same Bot Management/WAF/Turnstile rules as other traffic. - Requests come from Cloudflare ASN with identifying headers; origin owners can block or rate-limit based on those.
- Some worry the ability to set arbitrary User-Agent undermines the “well-behaved bot” claim, forcing sites to rely on headers instead.
- There is confusion over documentation links about bypassing bot protection (a referenced FAQ anchor appears missing).
Centralization, power, and “protection racket” concerns
- Multiple comments argue Cloudflare is “selling both the wall and the ladder”: offering anti-scraping and then a paid scraping channel, potentially creating scarcity they control.
- Fears that this could become the de facto way to crawl Cloudflare-protected sites, disadvantaging smaller players and centralizing access to web content and AI training data.
- Others point to Cloudflare’s “Pay Per Crawl” for site owners as part of a broader gatekeeper model.
- Counterargument: bot protection is mainly about availability (preventing origin overload and fraud), not secrecy, and a robots-respecting crawler is fundamentally different from abusive AI scrapers.
Technical limits, performance, and gaps
- Limits noted: e.g., documented caps like 5 crawl jobs/day and 100 pages per crawl (effectively ~500 pages/day), plus time-based browsing quotas.
- Some find that too small for “serious” crawling; others see it as reasonable for many use cases.
- The crawler intentionally does live browser fetches instead of using CDN cache, which some see as a missed efficiency opportunity.
- Requests to add web-archiving features (e.g., WARC output) and a site-admin-facing “nicely-crawled mirror” endpoint.
- Several report it still fails on some Cloudflare- or Azure-protected pages, and that third‑party services (like Firecrawl) sometimes perform better.
Broader web and AI implications
- Some see structured crawl endpoints as a natural evolution beyond raw
robots.txt/sitemaps, potentially reducing wasteful crawling. - Others warn about dual content (different for humans vs bots) enabling manipulation or supply-chain attacks.
- There is tension between enabling efficient, respectful crawling and reinforcing a two-tier internet where well-funded actors buy privileged access.