Miasma: A tool to trap AI web scrapers in an endless poison pit
Purpose & Mechanism
- Tool wraps “Poison Fountain” content and exposes it via hidden or hard-to-see links (e.g.,
/bots) to lure AI scrapers into a tarpit of plausible‑looking but incorrect text/code. - Goal: raise costs for crawlers that ignore robots.txt, potentially poison training data, and help identify/burn bad bots once they touch trap URLs.
- README suggests whitelist rules in
robots.txtfor “friendly” search bots so they avoid the trap.
Effectiveness and Arms Race
- Some argue even a small percentage of poisoned data can significantly harm models, exploiting economic asymmetry (cheap to poison, expensive to filter).
- Others think serious crawlers already filter
display:none/hidden content or won’t recurse infinitely, so this mostly catches naive scrapers or hobby tools. - Concern that this merely gives AI companies more training data on what “poison” looks like and will be quickly routed around.
- Anecdotes: fabricated libraries appearing in chat models suggest poisoning can propagate; counter‑claims that RLHF/verification, curated datasets, or RAG limit long‑term impact.
Alternatives to Poisoning
- Suggested defenses: IP blacklisting/rate limiting, HTTP/2 and header‑based detection, fetch metadata headers, Traefik plugins, CAPTCHAs/challenges, and robots.txt plus UA filtering.
- Others propose real‑time crawler blacklists or even toll/HTTP 402–style payment layers for scrapers.
- Many note these are hard because scrapers use residential proxies, rotate IPs, mimic browsers, and ignore robots.txt.
Ethics, Ownership, and “Theft”
- Strong disagreement over whether web scraping for AI is “stealing,” an abuse of copyright, or just remixing public information.
- Some creators resent their free work being monetized by large AI firms without consent, attribution, or compensation; say it discourages sharing.
- Others emphasize fair use, analogy to humans reading and learning, and warn against expanding copyright/control over what people (or models) can learn.
Impact on the Web & Search
- Hidden links and spammy patterns risk search penalties or delisting; critics see this as self‑harm and “epistemological vandalism.”
- Some welcome de‑indexing and envision a “small web” insulated from big search and AI; others still depend on Google visibility or ad revenue.
- Comparisons to anti‑spam and DRM: many see an endless whack‑a‑mole where defenders may expend more effort than attackers.
Longer‑term AI/Data Issues
- Discussion of “applied model collapse”: if enough slop and adversarial data enter the web, open‑web‑trained models may degrade.
- Several expect a shift toward licensed, curated, provenance‑tracked datasets; poisoning is viewed either as leverage to accelerate that, or as pointless vandalism on the way there.