AGENTS.md outperforms skills in our agent evals
What AGENTS.md vs Skills Actually Capture
- Many readers found the article confusing because AGENTS.md appears to replicate what skills already do: point the model to documentation with short descriptions and progressive disclosure.
- A common view: AGENTS.md is basically “a well-designed Skill baked into the system prompt” rather than a fundamentally different concept.
- The improvement is attributed less to “AGENTS vs Skills” and more to: better index design, fewer indirections, and always-on context.
Why AGENTS.md Seemed to Win in Their Evals
- With AGENTS.md, doc pointers (a compressed/minified index) are always in context; there’s no decision point about whether to invoke a skill.
- Skills add an extra step: the model must decide to use the skill, then the skill must locate the right docs; this fails surprisingly often. Several users report 5–50% non-invocation rates even when the need is obvious.
- Prompts that force skill use (“if there’s even 1% chance, you MUST use it”) or rigid activation phrases can improve adherence, but remain brittle.
Context, Compression, and Tradeoffs
- Directly loading lots of docs or .context folders can help small/medium projects but quickly bloats the context window, increases cost, and can degrade performance.
- The AGENTS.md index is seen as a middle ground: cheap, compressed pointers instead of full docs or probabilistic skill activation.
- Some argue this is unsurprising: if you optimize for one narrow task, “static linking” (AGENTS.md) will beat “dynamic linking” (skills); skills matter more when you have many capabilities and large codebases.
Reliability, Methodology, and Model Behavior
- Multiple commenters question the rigor of the evals: unclear number of runs, no error bars, single-model (Claude) behavior, and results that are close together.
- Others note that even with perfect context, LLM agents remain non-deterministic and flaky; production usage should treat them like unreliable distributed systems with monitoring and failover.
- There’s broad agreement that skills underperform today partly because models haven’t been extensively trained on them; many expect future generations and RL on tool-usage traces to close the gap.
Consensus Emerging in the Thread
- Best practice for now:
- Put a compact ToC/index (AGENTS.md/CLAUDE.md) always in system prompt.
- Use skills/MCP/secondary models for larger or specialized capabilities.
- Iterate with evals rather than trusting one-off “vibes” benchmarks.