2025-01-16

Nepenthes is a tarpit to catch AI web crawlers

How Nepenthes Works & Intended Goal

Generates infinite Markov-chain pages and recursive links to trap crawlers in a “maze.”
Intended specifically for AI crawlers that ignore robots.txt, but affects all crawlers.
Author warns that using it can cause a whole site to effectively disappear from search results, since crawlers may blacklist the domain.

Effectiveness Against Web & AI Crawlers

Several posters argue competent crawlers already:
- Limit crawl depth or pages per domain.
- Use priority queues favoring external or popular pages.
- Detect “infinite” sites, slow responses, and low-content pages and downrank/ban them.
Some think it still has value if:
- Deployed on a sub-route to get greedy AI bots to deprioritize or block the whole site.
- Used as a “fun” or symbolic resistance even if trivially filterable.
Others note real AI training pipelines filter low-quality / gibberish text anyway, so it won’t poison models much.

Risks, Self‑DoS, and Practical Limits

Slow-response tarpits can exhaust the server’s resources more than the crawler’s, becoming a self‑inflicted DoS.
Attackers could deliberately hammer the tarpit to overload the host.
Well-designed crawlers already cap per-domain requests; only poorly built or “cobbled together” AI bots will get seriously trapped.

Alternative Defenses & Related Tools

Common suggestions:
- Use robots.txt plus IP/UA blocking or Cloudflare bot protections.
- Honeypot links disallowed in robots.txt; fetching them triggers auto-bans.
- Priority-based throttling, rate limiting, or random early drop instead of heavy Markov generation.
- Serve nonsense or slightly corrupted content only to suspected AI bots (data poisoning), while serving clean content to humans.
Other similar projects and “spider traps” / “bot motels” have existed for years.

Legal, Ethical, and Economic Angles

Some propose contractual bans on AI training via EULAs; others respond that “legal traps” and penalty clauses are often unenforceable or require huge litigation budgets.
Complaints that aggressive crawlers (including AI and big-company bots) can consume bandwidth, raise hosting costs, and contribute to rising data-center energy use.
Debate over whether blocking AI crawlers is self-harming, given LLMs increasingly function as discovery mechanisms like search engines.

Related topics