Evaluating AGENTS.md: are they helpful for coding agents?

Reported impact of AGENTS.md in the paper

  • Thread highlights the core result: context files often reduce task success and increase cost, especially when auto-generated by LLMs.
  • Human-written files give only a small average boost (~4%), and not consistently across models; some models even regress.
  • Several commenters argue that measuring “success” as “PR passes tests” misses important dimensions like style, conventions, and maintainability.

How people actually use AGENTS/CLAUDE.md

  • Common contents: how to build/run tests, minimum language versions, preferred tools, project-specific conventions, and “don’t do X here” local rules.
  • Many only add rules reactively after an agent makes a specific mistake, then re-run the task to see if behavior improves.
  • Several use them mainly to encode tribal knowledge and non-obvious architecture decisions rather than things inferable from code.

When and why they fail

  • Instructions are applied inconsistently; agents often ignore even repeated, explicit rules (e.g., “don’t use Node APIs when Bun exists,” “don’t generate React in this Vue repo”).
  • Negative instructions (“do not …”) are seen as particularly fragile, likened to telling a toddler “don’t do X.”
  • Some move rules into deterministic enforcement (linters, pre-commit hooks, compiler checks) rather than trusting LLM obedience.

Design patterns for context docs

  • Strong support for short, focused, hierarchical files: a tiny top-level AGENTS/CLAUDE.md plus nested ones per app/feature.
  • Progressive disclosure is valued to reduce context “rot,” though it may trade off with token caching.
  • Many argue AGENTS.md is often just “a README/CONTRIBUTING the agent will actually read,” and suggest auto-ingesting existing docs instead of inventing new formats.

Skepticism, cargo culting, and metrics

  • Several see AGENTS.md tuning as pleasant but potentially self-delusional “prompt engineering,” reinforced by LLMs always affirming that new rules will help.
  • Research is welcomed as an antidote to cargo-cult prompting, but some note that results age quickly as models change.
  • Others argue a 4% gain is large if real, especially on hard tasks, and that token cost is minor compared to saved human time.

Anthropomorphizing and “why” questions

  • Long subthread debates whether asking agents why they did something yields meaningful insight versus post-hoc fiction from a next-token predictor.
  • “Thinking”/reasoning traces are seen by some as useful debug context, by others as just more tokens with no privileged status.