Miasma: A tool to trap AI web scrapers in an endless poison pit

Purpose & Mechanism

  • Tool wraps “Poison Fountain” content and exposes it via hidden or hard-to-see links (e.g., /bots) to lure AI scrapers into a tarpit of plausible‑looking but incorrect text/code.
  • Goal: raise costs for crawlers that ignore robots.txt, potentially poison training data, and help identify/burn bad bots once they touch trap URLs.
  • README suggests whitelist rules in robots.txt for “friendly” search bots so they avoid the trap.

Effectiveness and Arms Race

  • Some argue even a small percentage of poisoned data can significantly harm models, exploiting economic asymmetry (cheap to poison, expensive to filter).
  • Others think serious crawlers already filter display:none/hidden content or won’t recurse infinitely, so this mostly catches naive scrapers or hobby tools.
  • Concern that this merely gives AI companies more training data on what “poison” looks like and will be quickly routed around.
  • Anecdotes: fabricated libraries appearing in chat models suggest poisoning can propagate; counter‑claims that RLHF/verification, curated datasets, or RAG limit long‑term impact.

Alternatives to Poisoning

  • Suggested defenses: IP blacklisting/rate limiting, HTTP/2 and header‑based detection, fetch metadata headers, Traefik plugins, CAPTCHAs/challenges, and robots.txt plus UA filtering.
  • Others propose real‑time crawler blacklists or even toll/HTTP 402–style payment layers for scrapers.
  • Many note these are hard because scrapers use residential proxies, rotate IPs, mimic browsers, and ignore robots.txt.

Ethics, Ownership, and “Theft”

  • Strong disagreement over whether web scraping for AI is “stealing,” an abuse of copyright, or just remixing public information.
  • Some creators resent their free work being monetized by large AI firms without consent, attribution, or compensation; say it discourages sharing.
  • Others emphasize fair use, analogy to humans reading and learning, and warn against expanding copyright/control over what people (or models) can learn.

Impact on the Web & Search

  • Hidden links and spammy patterns risk search penalties or delisting; critics see this as self‑harm and “epistemological vandalism.”
  • Some welcome de‑indexing and envision a “small web” insulated from big search and AI; others still depend on Google visibility or ad revenue.
  • Comparisons to anti‑spam and DRM: many see an endless whack‑a‑mole where defenders may expend more effort than attackers.

Longer‑term AI/Data Issues

  • Discussion of “applied model collapse”: if enough slop and adversarial data enter the web, open‑web‑trained models may degrade.
  • Several expect a shift toward licensed, curated, provenance‑tracked datasets; poisoning is viewed either as leverage to accelerate that, or as pointless vandalism on the way there.