2026-01-29

AGENTS.md outperforms skills in our agent evals

What AGENTS.md vs Skills Actually Capture

Many readers found the article confusing because AGENTS.md appears to replicate what skills already do: point the model to documentation with short descriptions and progressive disclosure.
A common view: AGENTS.md is basically “a well-designed Skill baked into the system prompt” rather than a fundamentally different concept.
The improvement is attributed less to “AGENTS vs Skills” and more to: better index design, fewer indirections, and always-on context.

Why AGENTS.md Seemed to Win in Their Evals

With AGENTS.md, doc pointers (a compressed/minified index) are always in context; there’s no decision point about whether to invoke a skill.
Skills add an extra step: the model must decide to use the skill, then the skill must locate the right docs; this fails surprisingly often. Several users report 5–50% non-invocation rates even when the need is obvious.
Prompts that force skill use (“if there’s even 1% chance, you MUST use it”) or rigid activation phrases can improve adherence, but remain brittle.

Context, Compression, and Tradeoffs

Directly loading lots of docs or .context folders can help small/medium projects but quickly bloats the context window, increases cost, and can degrade performance.
The AGENTS.md index is seen as a middle ground: cheap, compressed pointers instead of full docs or probabilistic skill activation.
Some argue this is unsurprising: if you optimize for one narrow task, “static linking” (AGENTS.md) will beat “dynamic linking” (skills); skills matter more when you have many capabilities and large codebases.

Reliability, Methodology, and Model Behavior

Multiple commenters question the rigor of the evals: unclear number of runs, no error bars, single-model (Claude) behavior, and results that are close together.
Others note that even with perfect context, LLM agents remain non-deterministic and flaky; production usage should treat them like unreliable distributed systems with monitoring and failover.
There’s broad agreement that skills underperform today partly because models haven’t been extensively trained on them; many expect future generations and RL on tool-usage traces to close the gap.

Consensus Emerging in the Thread

Best practice for now:
- Put a compact ToC/index (AGENTS.md/CLAUDE.md) always in system prompt.
- Use skills/MCP/secondary models for larger or specialized capabilities.
- Iterate with evals rather than trusting one-off “vibes” benchmarks.

Related topics