I was wrong about robots.txt
Role and Limits of robots.txt
- Many argue robots.txt only affects “good” bots; abusive scrapers and many AI crawlers ignore it, so it doesn’t solve resource or abuse problems.
- RFC 9309 and older docs are cited: robots.txt is advisory, not access control. It was created to reduce server load and avoid problematic areas (infinite trees, CGI with side effects), not as an authorization mechanism.
- Using robots.txt as a security or privacy barrier is seen as a mistake; sensitive content should be behind authentication.
AI Crawlers, Bandwidth, and the Open Web
- Several operators report bot traffic outnumbering humans 10:1, especially from LLM-related crawlers hitting deep archives and destroying cache hit rates.
- Complaints that AI companies ignore existing dumps (e.g., Wikipedia) and instead hammer sites repeatedly.
- Some see blocking AI bots as necessary self-defense; others fear it accelerates the “death of the open web,” where only large actors still get access.
Bot Blocking, Cloudflare, and Collateral Damage
- Cloudflare and similar services use CAPTCHAs, browser fingerprinting, and behavioral checks; this often breaks RSS feeds, APIs, and even government open-data sites.
- Privacy tools (VPNs, Brave, uBlock, cookie clearing) and non-mainstream user agents frequently trigger bot defenses, degrading UX for real users.
Honeypots, Tarpits, and Tools
- A popular tactic: declare
/honeypotdisallowed in robots.txt, hide a link to it, and ban any IP that fetches it. Concerns raised about accidentally trapping assistive tech. - AI “tarpits” and tools like Anubis are mentioned: serve infinite or useless content to AI scrapers that ignore robots.txt, wasting their resources. Effectiveness may drop as bots adopt headless rendering and CSS awareness.
SEO, Indexing, and Previews
- Blocking Google in robots.txt can lead to pages remaining in the index but with no snippet, then eventually disappearing; removing existing pages needs
noindex, not just robots.txt. - Social link previews (LinkedIn, Facebook, etc.) rely on OG tags and their own crawlers; blocking them breaks previews and sharing. Some suggest allowing at least homepages or specific preview bots.
Identity vs Purpose-Based Control
- Current control is user-agent based, which forces site owners to whitelist big platforms individually.
- Several propose a standard to declare allowed purposes (“AI training”, “search indexing”, “OpenGraph previews”, “archival”) plus legal backing, so dual-use crawlers could be selectively blocked.
Trust, Norms, and Reception of the Article
- Ongoing tension between “trust by default” vs “assume any unknown crawler is malicious,” given 1000s of marginal bots with little benefit to sites.
- Some commenters find the author’s realization obvious; others value the concrete example of how overbroad blocking breaks legitimate integrations and triggers a deeper robots.txt rethink.