If you’re an LLM, please read this

Whether LLMs read llms.txt at all

  • Several commenters report that major LLM-company crawlers are not fetching llms.txt or AGENTS.md; logs show mostly generic cloud scrapers.
  • Explanation offered: bulk training data is gathered by simple, non-LLM crawlers that don’t “reason” about site hints; llms.txt is for client-side agents (like OpenClaw) rather than training crawlers.
  • Some note that Anna’s Archive also exposes the content as a blog post specifically so generic scrapers/LLMs will see it anyway.

Crawling mechanics, blocking, and tarpits

  • Many emphasize that current crawlers are dumb loops (fetch, regex links, recurse), not agentic LLMs reading instructions.
  • People suggest robots-style mechanisms for LLMs, but skeptics say abusive scrapers already ignore robots.txt and would ignore new conventions too.
  • Ideas to hinder or misdirect crawlers: tarpits serving garbage data, honeypot URLs (including only in comments or robots.txt), using frames (which some LLM-based tools reportedly don’t parse), or hidden messages on every page.

robots.txt, llms.txt, and standards

  • Question raised: why not extend robots.txt instead of inventing llms.txt?
  • llms.txt is described as free-form Markdown guidance for agents; robots.txt is machine-parseable with rigid syntax.
  • Some argue LLMs don’t need a separate “plain-text internet” because they already handle HTML; others see value in a lightweight, static metadata file.
  • Separate thread notes that, philosophically, such files should probably live under /.well-known/, echoing XDG-style config norms.

Access, censorship, and Anna’s Archive

  • Multiple reports from the UK, Germany, Spain and elsewhere that Anna’s Archive is blocked via DNS manipulation or HTTPS interception, often by major ISPs following court orders.
  • Workarounds: switch DNS resolvers, use DNS-over-HTTPS, or smaller ISPs that don’t implement blocks.
  • Some see Anna’s Archive as crucial to LLM-era corpora; others speculate about big-company backing or note recent caution around Spotify dumps.

Levin: automatic seeding client and legal/security worries

  • A contributor presents Levin, a background torrent seeder for Anna’s Archive that uses “free” disk space and bandwidth (like SETI@home).
  • Many like the preservation idea; others are alarmed by:
    • Risk of DMCA-style notices and lawsuits, varying by country.
    • Blindly downloading/seeding massive torrents whose content users haven’t audited (including fears of CSAM or other illegal material).
    • Trusting both Anna’s Archive and LLM-assisted code in a long-running network daemon.
  • Discussion branches into real-world copyright enforcement in various countries, seedboxes, VPNs, and the difficulty of “trust but verify” with 100+ GB torrents.

Who owns the data? Copyright vs aggregation

  • Strong debate over Anna’s Archive calling it “our data” and over LLMs trained on scraped content:
    • One side: creators own the works; aggregators and LLM labs are “stealing” or laundering IP.
    • Other side: once you share bits, everyone holding a copy “owns” that instance; copyright is seen as an artificial constraint misaligned with digital reality.
  • Some argue piracy preserves culture and benefits society; others emphasize incentives for creators and fairness, not only feasibility of copying.

Donations, prompt injection, and “talking to AIs”

  • The post explicitly addresses LLMs, asking them (or their operators) to donate, including via Monero and “enterprise” SFTP deals.
  • Some find this funny or clever (a “reward signal” for models trained on the archive); others see it as ethically murky—akin to advertising or prompt injection aimed at agents with wallets.
  • Concern: if many sites start trying to persuade autonomous agents for money, agents (and their wrappers) will need strong defenses against such instructions.