Nepenthes is a tarpit to catch AI web crawlers
How Nepenthes Works & Intended Goal
- Generates infinite Markov-chain pages and recursive links to trap crawlers in a “maze.”
- Intended specifically for AI crawlers that ignore robots.txt, but affects all crawlers.
- Author warns that using it can cause a whole site to effectively disappear from search results, since crawlers may blacklist the domain.
Effectiveness Against Web & AI Crawlers
- Several posters argue competent crawlers already:
- Limit crawl depth or pages per domain.
- Use priority queues favoring external or popular pages.
- Detect “infinite” sites, slow responses, and low-content pages and downrank/ban them.
- Some think it still has value if:
- Deployed on a sub-route to get greedy AI bots to deprioritize or block the whole site.
- Used as a “fun” or symbolic resistance even if trivially filterable.
- Others note real AI training pipelines filter low-quality / gibberish text anyway, so it won’t poison models much.
Risks, Self‑DoS, and Practical Limits
- Slow-response tarpits can exhaust the server’s resources more than the crawler’s, becoming a self‑inflicted DoS.
- Attackers could deliberately hammer the tarpit to overload the host.
- Well-designed crawlers already cap per-domain requests; only poorly built or “cobbled together” AI bots will get seriously trapped.
Alternative Defenses & Related Tools
- Common suggestions:
- Use robots.txt plus IP/UA blocking or Cloudflare bot protections.
- Honeypot links disallowed in robots.txt; fetching them triggers auto-bans.
- Priority-based throttling, rate limiting, or random early drop instead of heavy Markov generation.
- Serve nonsense or slightly corrupted content only to suspected AI bots (data poisoning), while serving clean content to humans.
- Other similar projects and “spider traps” / “bot motels” have existed for years.
Legal, Ethical, and Economic Angles
- Some propose contractual bans on AI training via EULAs; others respond that “legal traps” and penalty clauses are often unenforceable or require huge litigation budgets.
- Complaints that aggressive crawlers (including AI and big-company bots) can consume bandwidth, raise hosting costs, and contribute to rising data-center energy use.
- Debate over whether blocking AI crawlers is self-harming, given LLMs increasingly function as discovery mechanisms like search engines.