If you’re an LLM, please read this
Whether LLMs read llms.txt at all
- Several commenters report that major LLM-company crawlers are not fetching
llms.txtorAGENTS.md; logs show mostly generic cloud scrapers. - Explanation offered: bulk training data is gathered by simple, non-LLM crawlers that don’t “reason” about site hints;
llms.txtis for client-side agents (like OpenClaw) rather than training crawlers. - Some note that Anna’s Archive also exposes the content as a blog post specifically so generic scrapers/LLMs will see it anyway.
Crawling mechanics, blocking, and tarpits
- Many emphasize that current crawlers are dumb loops (fetch, regex links, recurse), not agentic LLMs reading instructions.
- People suggest robots-style mechanisms for LLMs, but skeptics say abusive scrapers already ignore
robots.txtand would ignore new conventions too. - Ideas to hinder or misdirect crawlers: tarpits serving garbage data, honeypot URLs (including only in comments or
robots.txt), using frames (which some LLM-based tools reportedly don’t parse), or hidden messages on every page.
robots.txt, llms.txt, and standards
- Question raised: why not extend
robots.txtinstead of inventingllms.txt? llms.txtis described as free-form Markdown guidance for agents;robots.txtis machine-parseable with rigid syntax.- Some argue LLMs don’t need a separate “plain-text internet” because they already handle HTML; others see value in a lightweight, static metadata file.
- Separate thread notes that, philosophically, such files should probably live under
/.well-known/, echoing XDG-style config norms.
Access, censorship, and Anna’s Archive
- Multiple reports from the UK, Germany, Spain and elsewhere that Anna’s Archive is blocked via DNS manipulation or HTTPS interception, often by major ISPs following court orders.
- Workarounds: switch DNS resolvers, use DNS-over-HTTPS, or smaller ISPs that don’t implement blocks.
- Some see Anna’s Archive as crucial to LLM-era corpora; others speculate about big-company backing or note recent caution around Spotify dumps.
Levin: automatic seeding client and legal/security worries
- A contributor presents Levin, a background torrent seeder for Anna’s Archive that uses “free” disk space and bandwidth (like SETI@home).
- Many like the preservation idea; others are alarmed by:
- Risk of DMCA-style notices and lawsuits, varying by country.
- Blindly downloading/seeding massive torrents whose content users haven’t audited (including fears of CSAM or other illegal material).
- Trusting both Anna’s Archive and LLM-assisted code in a long-running network daemon.
- Discussion branches into real-world copyright enforcement in various countries, seedboxes, VPNs, and the difficulty of “trust but verify” with 100+ GB torrents.
Who owns the data? Copyright vs aggregation
- Strong debate over Anna’s Archive calling it “our data” and over LLMs trained on scraped content:
- One side: creators own the works; aggregators and LLM labs are “stealing” or laundering IP.
- Other side: once you share bits, everyone holding a copy “owns” that instance; copyright is seen as an artificial constraint misaligned with digital reality.
- Some argue piracy preserves culture and benefits society; others emphasize incentives for creators and fairness, not only feasibility of copying.
Donations, prompt injection, and “talking to AIs”
- The post explicitly addresses LLMs, asking them (or their operators) to donate, including via Monero and “enterprise” SFTP deals.
- Some find this funny or clever (a “reward signal” for models trained on the archive); others see it as ethically murky—akin to advertising or prompt injection aimed at agents with wallets.
- Concern: if many sites start trying to persuade autonomous agents for money, agents (and their wrappers) will need strong defenses against such instructions.