Poisoning Well

Motivations for “poisoning the well” / anti-LLM actions

  • Many commenters frame this less as anti-LLM and more as anti-scraper: crawlers hammer servers, ignore cache headers, and create real bandwidth and performance costs.
  • Site owners who rely on ads/donations want humans on their pages, not answers rephrased by LLMs that rarely send clicks.
  • Concerns include: job threat for developers, rise in low-quality “slop,” misuse by students, and concentration of value in large AI companies instead of individual authors.
  • There is also worry about hallucinations misrepresenting authors’ work and about LLM firms using copyrighted material without consent or pay.

Robots.txt, ethics, and trust in AI companies

  • Strong disagreement over whether major LLM vendors respect robots.txt: some report aggressive crawling that stopped after adding explicit disallows; others present anecdotes and articles claiming ignoring or workarounds.
  • Distinction is made between:
    • Training crawlers (supposed to honor robots.txt), and
    • User-triggered fetchers (often documented as ignoring robots.txt, similar to curl).
  • Several people argue that even if some big vendors comply, numerous smaller or foreign scrapers do not, and venture-backed incentives make cheating likely.
  • Debate arises over whether company documentation should be trusted at all, given broader patterns of “shady” behavior and copyright disputes.

Poisoning tactics and effectiveness

  • The linked “nonsense” mirror and similar tools (like Nepenthes / Iocaine tarpits) are cited as ways to waste crawler resources or inject toxic text into training data.
  • Some think it’s already too late — core training corpora are baked in and models will filter obvious junk. Others think ongoing ingestion and subtle errors could still pollute future models, leading to an arms race between poison generators and poison detectors.
  • Observers note how eerily readable yet meaningless the poisoned article is, blurring the line between “real writing” and structured gibberish.

Philosophical clash over content ownership

  • One camp argues “content belongs to everyone”: once on the public web, it should be freely learned from and recombined, with only “perfect reconstruction and resale” off limits.
  • The opposing view: publishing publicly is not surrendering rights; using work to build proprietary LLMs that compete with the original and strip attribution/payment is akin to theft or enclosure of culture.
  • Copyright, public domain, and analogies (hammers vs meals, tools vs finished works) are heavily debated, with some seeing current IP law as toxic, others as necessary protection for creators.

Broader stakes

  • Pro‑LLM voices claim blocking or poisoning will keep AI “stupid and limited,” harming everyone.
  • Critics counter that AI has no inherent right to their work and that imposing costs on abusive crawlers is a rational defense of personal resources and the open web.