2025-07-17

I was wrong about robots.txt

Role and Limits of robots.txt

Many argue robots.txt only affects “good” bots; abusive scrapers and many AI crawlers ignore it, so it doesn’t solve resource or abuse problems.
RFC 9309 and older docs are cited: robots.txt is advisory, not access control. It was created to reduce server load and avoid problematic areas (infinite trees, CGI with side effects), not as an authorization mechanism.
Using robots.txt as a security or privacy barrier is seen as a mistake; sensitive content should be behind authentication.

AI Crawlers, Bandwidth, and the Open Web

Several operators report bot traffic outnumbering humans 10:1, especially from LLM-related crawlers hitting deep archives and destroying cache hit rates.
Complaints that AI companies ignore existing dumps (e.g., Wikipedia) and instead hammer sites repeatedly.
Some see blocking AI bots as necessary self-defense; others fear it accelerates the “death of the open web,” where only large actors still get access.

Bot Blocking, Cloudflare, and Collateral Damage

Cloudflare and similar services use CAPTCHAs, browser fingerprinting, and behavioral checks; this often breaks RSS feeds, APIs, and even government open-data sites.
Privacy tools (VPNs, Brave, uBlock, cookie clearing) and non-mainstream user agents frequently trigger bot defenses, degrading UX for real users.

Honeypots, Tarpits, and Tools

A popular tactic: declare /honeypot disallowed in robots.txt, hide a link to it, and ban any IP that fetches it. Concerns raised about accidentally trapping assistive tech.
AI “tarpits” and tools like Anubis are mentioned: serve infinite or useless content to AI scrapers that ignore robots.txt, wasting their resources. Effectiveness may drop as bots adopt headless rendering and CSS awareness.

SEO, Indexing, and Previews

Blocking Google in robots.txt can lead to pages remaining in the index but with no snippet, then eventually disappearing; removing existing pages needs noindex, not just robots.txt.
Social link previews (LinkedIn, Facebook, etc.) rely on OG tags and their own crawlers; blocking them breaks previews and sharing. Some suggest allowing at least homepages or specific preview bots.

Identity vs Purpose-Based Control

Current control is user-agent based, which forces site owners to whitelist big platforms individually.
Several propose a standard to declare allowed purposes (“AI training”, “search indexing”, “OpenGraph previews”, “archival”) plus legal backing, so dual-use crawlers could be selectively blocked.

Trust, Norms, and Reception of the Article

Ongoing tension between “trust by default” vs “assume any unknown crawler is malicious,” given 1000s of marginal bots with little benefit to sites.
Some commenters find the author’s realization obvious; others value the concrete example of how overbroad blocking breaks legitimate integrations and triggers a deeper robots.txt rethink.

Related topics