2026-02-16

Evaluating AGENTS.md: are they helpful for coding agents?

Original Article ↗ Hacker News Discussion ↗

Reported impact of AGENTS.md in the paper

Thread highlights the core result: context files often reduce task success and increase cost, especially when auto-generated by LLMs.
Human-written files give only a small average boost (~4%), and not consistently across models; some models even regress.
Several commenters argue that measuring “success” as “PR passes tests” misses important dimensions like style, conventions, and maintainability.

How people actually use AGENTS/CLAUDE.md

Common contents: how to build/run tests, minimum language versions, preferred tools, project-specific conventions, and “don’t do X here” local rules.
Many only add rules reactively after an agent makes a specific mistake, then re-run the task to see if behavior improves.
Several use them mainly to encode tribal knowledge and non-obvious architecture decisions rather than things inferable from code.

When and why they fail

Instructions are applied inconsistently; agents often ignore even repeated, explicit rules (e.g., “don’t use Node APIs when Bun exists,” “don’t generate React in this Vue repo”).
Negative instructions (“do not …”) are seen as particularly fragile, likened to telling a toddler “don’t do X.”
Some move rules into deterministic enforcement (linters, pre-commit hooks, compiler checks) rather than trusting LLM obedience.

Design patterns for context docs

Strong support for short, focused, hierarchical files: a tiny top-level AGENTS/CLAUDE.md plus nested ones per app/feature.
Progressive disclosure is valued to reduce context “rot,” though it may trade off with token caching.
Many argue AGENTS.md is often just “a README/CONTRIBUTING the agent will actually read,” and suggest auto-ingesting existing docs instead of inventing new formats.

Skepticism, cargo culting, and metrics

Several see AGENTS.md tuning as pleasant but potentially self-delusional “prompt engineering,” reinforced by LLMs always affirming that new rules will help.
Research is welcomed as an antidote to cargo-cult prompting, but some note that results age quickly as models change.
Others argue a 4% gain is large if real, especially on hard tasks, and that token cost is minor compared to saved human time.

Anthropomorphizing and “why” questions

Long subthread debates whether asking agents why they did something yields meaningful insight versus post-hoc fiction from a next-token predictor.
“Thinking”/reasoning traces are seen by some as useful debug context, by others as just more tokens with no privileged status.