Evaluating AGENTS.md: are they helpful for coding agents?
Reported impact of AGENTS.md in the paper
- Thread highlights the core result: context files often reduce task success and increase cost, especially when auto-generated by LLMs.
- Human-written files give only a small average boost (~4%), and not consistently across models; some models even regress.
- Several commenters argue that measuring “success” as “PR passes tests” misses important dimensions like style, conventions, and maintainability.
How people actually use AGENTS/CLAUDE.md
- Common contents: how to build/run tests, minimum language versions, preferred tools, project-specific conventions, and “don’t do X here” local rules.
- Many only add rules reactively after an agent makes a specific mistake, then re-run the task to see if behavior improves.
- Several use them mainly to encode tribal knowledge and non-obvious architecture decisions rather than things inferable from code.
When and why they fail
- Instructions are applied inconsistently; agents often ignore even repeated, explicit rules (e.g., “don’t use Node APIs when Bun exists,” “don’t generate React in this Vue repo”).
- Negative instructions (“do not …”) are seen as particularly fragile, likened to telling a toddler “don’t do X.”
- Some move rules into deterministic enforcement (linters, pre-commit hooks, compiler checks) rather than trusting LLM obedience.
Design patterns for context docs
- Strong support for short, focused, hierarchical files: a tiny top-level AGENTS/CLAUDE.md plus nested ones per app/feature.
- Progressive disclosure is valued to reduce context “rot,” though it may trade off with token caching.
- Many argue AGENTS.md is often just “a README/CONTRIBUTING the agent will actually read,” and suggest auto-ingesting existing docs instead of inventing new formats.
Skepticism, cargo culting, and metrics
- Several see AGENTS.md tuning as pleasant but potentially self-delusional “prompt engineering,” reinforced by LLMs always affirming that new rules will help.
- Research is welcomed as an antidote to cargo-cult prompting, but some note that results age quickly as models change.
- Others argue a 4% gain is large if real, especially on hard tasks, and that token cost is minor compared to saved human time.
Anthropomorphizing and “why” questions
- Long subthread debates whether asking agents why they did something yields meaningful insight versus post-hoc fiction from a next-token predictor.
- “Thinking”/reasoning traces are seen by some as useful debug context, by others as just more tokens with no privileged status.