2024-09-03

Llms.txt

Original Article ↗ Hacker News Discussion ↗

Purpose of llms.txt

Proposed as a small text/Markdown file at site root listing AI-friendly docs and key links.
Intended mainly for end-users and tools (e.g., IDEs, “projects” features) to assemble good LLM context about a library/site, especially for content created after model training cutoffs.
Not pitched as a training-data spec, but as a way to curate minimal, well-structured context for inference-time use.

Incentives and Value for Site Owners

Supporters:
- Useful for documentation-heavy projects and open source libraries that want LLMs to help users quickly.
- Might act as a forcing function to write clear, concise summaries that are also helpful to humans.
Skeptics:
- Little direct benefit; mostly helps AI products, not authors.
- Could make original content less visited and more “obsolete” as users stay in LLM interfaces.

Scraping, Control, and Compensation

Strong frustration that LLM companies profit from scraped content without attribution or payment; some compare it to theft.
Calls for mechanisms to declare prices or enforce “right_to_be_un_vectorized,” but recognized as aspirational.
General sentiment that robots.txt / ai-txt-style signals are weak; bad actors ignore them, and even some big AI crawlers disregard crawl delays.

Technical Design Debates

Many argue this should be a .well-known resource or extension of robots.txt / existing metadata, not another root file.
Some question why Markdown is used at all; plain text, HTML, or existing formats (OpenAPI, man pages, etc.) already work.
Concern that LLMs should be able to parse normal HTML/docs; needing llms.txt is seen as a symptom of poor site structure or weak models.

Manipulation, Poisoning, and Security

Multiple commenters note llms.txt could be abused to poison models or present LLM-only misleading content.
Others argue this risk already exists via normal pages; llms.txt doesn’t fundamentally change the attack surface.

Broader Web & UX Concerns

Fear that this further optimizes the web for machines over humans, instead of fixing confusing, marketing-heavy sites.
Parallels drawn to the Semantic Web and prior machine-readable metadata efforts, with mixed historical success.
Some would use llms.txt themselves as an ad-free, concise human-readable “real” docs page.