2026-03-29

Miasma: A tool to trap AI web scrapers in an endless poison pit

Purpose & Mechanism

Tool wraps “Poison Fountain” content and exposes it via hidden or hard-to-see links (e.g., /bots) to lure AI scrapers into a tarpit of plausible‑looking but incorrect text/code.
Goal: raise costs for crawlers that ignore robots.txt, potentially poison training data, and help identify/burn bad bots once they touch trap URLs.
README suggests whitelist rules in robots.txt for “friendly” search bots so they avoid the trap.

Effectiveness and Arms Race

Some argue even a small percentage of poisoned data can significantly harm models, exploiting economic asymmetry (cheap to poison, expensive to filter).
Others think serious crawlers already filter display:none/hidden content or won’t recurse infinitely, so this mostly catches naive scrapers or hobby tools.
Concern that this merely gives AI companies more training data on what “poison” looks like and will be quickly routed around.
Anecdotes: fabricated libraries appearing in chat models suggest poisoning can propagate; counter‑claims that RLHF/verification, curated datasets, or RAG limit long‑term impact.

Alternatives to Poisoning

Suggested defenses: IP blacklisting/rate limiting, HTTP/2 and header‑based detection, fetch metadata headers, Traefik plugins, CAPTCHAs/challenges, and robots.txt plus UA filtering.
Others propose real‑time crawler blacklists or even toll/HTTP 402–style payment layers for scrapers.
Many note these are hard because scrapers use residential proxies, rotate IPs, mimic browsers, and ignore robots.txt.

Ethics, Ownership, and “Theft”

Strong disagreement over whether web scraping for AI is “stealing,” an abuse of copyright, or just remixing public information.
Some creators resent their free work being monetized by large AI firms without consent, attribution, or compensation; say it discourages sharing.
Others emphasize fair use, analogy to humans reading and learning, and warn against expanding copyright/control over what people (or models) can learn.

Impact on the Web & Search

Hidden links and spammy patterns risk search penalties or delisting; critics see this as self‑harm and “epistemological vandalism.”
Some welcome de‑indexing and envision a “small web” insulated from big search and AI; others still depend on Google visibility or ad revenue.
Comparisons to anti‑spam and DRM: many see an endless whack‑a‑mole where defenders may expend more effort than attackers.

Longer‑term AI/Data Issues

Discussion of “applied model collapse”: if enough slop and adversarial data enter the web, open‑web‑trained models may degrade.
Several expect a shift toward licensed, curated, provenance‑tracked datasets; poisoning is viewed either as leverage to accelerate that, or as pointless vandalism on the way there.

Related topics