Nepenthes is a tarpit to catch AI web crawlers

How Nepenthes Works & Intended Goal

  • Generates infinite Markov-chain pages and recursive links to trap crawlers in a “maze.”
  • Intended specifically for AI crawlers that ignore robots.txt, but affects all crawlers.
  • Author warns that using it can cause a whole site to effectively disappear from search results, since crawlers may blacklist the domain.

Effectiveness Against Web & AI Crawlers

  • Several posters argue competent crawlers already:
    • Limit crawl depth or pages per domain.
    • Use priority queues favoring external or popular pages.
    • Detect “infinite” sites, slow responses, and low-content pages and downrank/ban them.
  • Some think it still has value if:
    • Deployed on a sub-route to get greedy AI bots to deprioritize or block the whole site.
    • Used as a “fun” or symbolic resistance even if trivially filterable.
  • Others note real AI training pipelines filter low-quality / gibberish text anyway, so it won’t poison models much.

Risks, Self‑DoS, and Practical Limits

  • Slow-response tarpits can exhaust the server’s resources more than the crawler’s, becoming a self‑inflicted DoS.
  • Attackers could deliberately hammer the tarpit to overload the host.
  • Well-designed crawlers already cap per-domain requests; only poorly built or “cobbled together” AI bots will get seriously trapped.

Alternative Defenses & Related Tools

  • Common suggestions:
    • Use robots.txt plus IP/UA blocking or Cloudflare bot protections.
    • Honeypot links disallowed in robots.txt; fetching them triggers auto-bans.
    • Priority-based throttling, rate limiting, or random early drop instead of heavy Markov generation.
    • Serve nonsense or slightly corrupted content only to suspected AI bots (data poisoning), while serving clean content to humans.
  • Other similar projects and “spider traps” / “bot motels” have existed for years.

Legal, Ethical, and Economic Angles

  • Some propose contractual bans on AI training via EULAs; others respond that “legal traps” and penalty clauses are often unenforceable or require huge litigation budgets.
  • Complaints that aggressive crawlers (including AI and big-company bots) can consume bandwidth, raise hosting costs, and contribute to rising data-center energy use.
  • Debate over whether blocking AI crawlers is self-harming, given LLMs increasingly function as discovery mechanisms like search engines.