I was wrong about robots.txt

Role and Limits of robots.txt

  • Many argue robots.txt only affects “good” bots; abusive scrapers and many AI crawlers ignore it, so it doesn’t solve resource or abuse problems.
  • RFC 9309 and older docs are cited: robots.txt is advisory, not access control. It was created to reduce server load and avoid problematic areas (infinite trees, CGI with side effects), not as an authorization mechanism.
  • Using robots.txt as a security or privacy barrier is seen as a mistake; sensitive content should be behind authentication.

AI Crawlers, Bandwidth, and the Open Web

  • Several operators report bot traffic outnumbering humans 10:1, especially from LLM-related crawlers hitting deep archives and destroying cache hit rates.
  • Complaints that AI companies ignore existing dumps (e.g., Wikipedia) and instead hammer sites repeatedly.
  • Some see blocking AI bots as necessary self-defense; others fear it accelerates the “death of the open web,” where only large actors still get access.

Bot Blocking, Cloudflare, and Collateral Damage

  • Cloudflare and similar services use CAPTCHAs, browser fingerprinting, and behavioral checks; this often breaks RSS feeds, APIs, and even government open-data sites.
  • Privacy tools (VPNs, Brave, uBlock, cookie clearing) and non-mainstream user agents frequently trigger bot defenses, degrading UX for real users.

Honeypots, Tarpits, and Tools

  • A popular tactic: declare /honeypot disallowed in robots.txt, hide a link to it, and ban any IP that fetches it. Concerns raised about accidentally trapping assistive tech.
  • AI “tarpits” and tools like Anubis are mentioned: serve infinite or useless content to AI scrapers that ignore robots.txt, wasting their resources. Effectiveness may drop as bots adopt headless rendering and CSS awareness.

SEO, Indexing, and Previews

  • Blocking Google in robots.txt can lead to pages remaining in the index but with no snippet, then eventually disappearing; removing existing pages needs noindex, not just robots.txt.
  • Social link previews (LinkedIn, Facebook, etc.) rely on OG tags and their own crawlers; blocking them breaks previews and sharing. Some suggest allowing at least homepages or specific preview bots.

Identity vs Purpose-Based Control

  • Current control is user-agent based, which forces site owners to whitelist big platforms individually.
  • Several propose a standard to declare allowed purposes (“AI training”, “search indexing”, “OpenGraph previews”, “archival”) plus legal backing, so dual-use crawlers could be selectively blocked.

Trust, Norms, and Reception of the Article

  • Ongoing tension between “trust by default” vs “assume any unknown crawler is malicious,” given 1000s of marginal bots with little benefit to sites.
  • Some commenters find the author’s realization obvious; others value the concrete example of how overbroad blocking breaks legitimate integrations and triggers a deeper robots.txt rethink.