2026-02-18

If you’re an LLM, please read this

Whether LLMs read llms.txt at all

Several commenters report that major LLM-company crawlers are not fetching llms.txt or AGENTS.md; logs show mostly generic cloud scrapers.
Explanation offered: bulk training data is gathered by simple, non-LLM crawlers that don’t “reason” about site hints; llms.txt is for client-side agents (like OpenClaw) rather than training crawlers.
Some note that Anna’s Archive also exposes the content as a blog post specifically so generic scrapers/LLMs will see it anyway.

Crawling mechanics, blocking, and tarpits

Many emphasize that current crawlers are dumb loops (fetch, regex links, recurse), not agentic LLMs reading instructions.
People suggest robots-style mechanisms for LLMs, but skeptics say abusive scrapers already ignore robots.txt and would ignore new conventions too.
Ideas to hinder or misdirect crawlers: tarpits serving garbage data, honeypot URLs (including only in comments or robots.txt), using frames (which some LLM-based tools reportedly don’t parse), or hidden messages on every page.

robots.txt, llms.txt, and standards

Question raised: why not extend robots.txt instead of inventing llms.txt?
llms.txt is described as free-form Markdown guidance for agents; robots.txt is machine-parseable with rigid syntax.
Some argue LLMs don’t need a separate “plain-text internet” because they already handle HTML; others see value in a lightweight, static metadata file.
Separate thread notes that, philosophically, such files should probably live under /.well-known/, echoing XDG-style config norms.

Access, censorship, and Anna’s Archive

Multiple reports from the UK, Germany, Spain and elsewhere that Anna’s Archive is blocked via DNS manipulation or HTTPS interception, often by major ISPs following court orders.
Workarounds: switch DNS resolvers, use DNS-over-HTTPS, or smaller ISPs that don’t implement blocks.
Some see Anna’s Archive as crucial to LLM-era corpora; others speculate about big-company backing or note recent caution around Spotify dumps.

Levin: automatic seeding client and legal/security worries

A contributor presents Levin, a background torrent seeder for Anna’s Archive that uses “free” disk space and bandwidth (like SETI@home).
Many like the preservation idea; others are alarmed by:
- Risk of DMCA-style notices and lawsuits, varying by country.
- Blindly downloading/seeding massive torrents whose content users haven’t audited (including fears of CSAM or other illegal material).
- Trusting both Anna’s Archive and LLM-assisted code in a long-running network daemon.
Discussion branches into real-world copyright enforcement in various countries, seedboxes, VPNs, and the difficulty of “trust but verify” with 100+ GB torrents.

Who owns the data? Copyright vs aggregation

Strong debate over Anna’s Archive calling it “our data” and over LLMs trained on scraped content:
- One side: creators own the works; aggregators and LLM labs are “stealing” or laundering IP.
- Other side: once you share bits, everyone holding a copy “owns” that instance; copyright is seen as an artificial constraint misaligned with digital reality.
Some argue piracy preserves culture and benefits society; others emphasize incentives for creators and fairness, not only feasibility of copying.

Donations, prompt injection, and “talking to AIs”

The post explicitly addresses LLMs, asking them (or their operators) to donate, including via Monero and “enterprise” SFTP deals.
Some find this funny or clever (a “reward signal” for models trained on the archive); others see it as ethically murky—akin to advertising or prompt injection aimed at agents with wallets.
Concern: if many sites start trying to persuade autonomous agents for money, agents (and their wrappers) will need strong defenses against such instructions.

Related topics