2026-02-16

SkillsBench: Benchmarking how well agent skills work across diverse tasks

Degradation from Multi-Layered LLM Use

Several commenters report that stacking LLM layers (plan → design → implementation all by AI) degrades quality: the more layers delegated, the messier and less maintainable the result.
This is framed as an “open-loop” problem: without feedback, verification, or human steering, each layer compounds errors and vagueness.
Some describe a “semantic collapse” effect when LLM outputs are repeatedly re-fed (for text, code, or images), likened to a telephone game; fresh human input is needed to reset quality.
Others note that context size and reset don’t fully fix it; LLM-produced tokens seem weaker as inputs than human-written ones, even with new sessions.

What the Paper Actually Tests (and Why Many Find It Misleading)

The paper’s “self-generated skills” are created before doing the task, with no tool access, no web search, no codebase exploration, and no fresh context restart.
Many argue this setup is unrealistic: it forces the model to write generic how-to docs from its own latent knowledge, then “use” those same hallucinations.
Commenters stress this is not how practitioners use skills; they consider the negative result unsurprising and of limited practical relevance.

How Practitioners Really Use Skills

Common real-world pattern: solve or attempt a task with the model, steer it, then distill what was learned into a skill; refine that skill across future runs.
Skills are seen as:
- Project- or org-specific memory: infra details, codebase patterns, internal tools, domain quirks, team preferences.
- A compression/cache of reasoning to cut repeated exploration and token use on recurring tasks.
- Guardrails: “what not to do”, constraints, and quality rules.
Self-generated skills are considered useful only when backed by new information: research results, experiments, proprietary docs, or human clarification—not just the model rephrasing what it already “knows”.

Interpretation of the Results (Curated vs Self-Generated Skills)

The reported gap—self-generated skills slightly harmful (–1.3pp) vs curated skills strongly helpful (+16.2pp)—matches many practitioners’ experience: LLMs are better consumers than producers of procedural knowledge.
The large gains in underrepresented domains (e.g. +51.9pp in healthcare vs +4.5pp in SWE) are seen as evidence that skills matter most where model priors are weak and knowledge is specialized/proprietary.
Commenters suggest the “missing condition” is human–AI co-created skills with real feedback; they expect this would outperform both raw and pre-written skill setups.

Risks, Limits, and Broader Reflections

Uncurated self-generated docs can codify and spread bad practices, especially in code, if teams treat them as “best practices” without review.
Some see skills and markdown memories as a crutch until true continual learning (weight updates) is feasible; others argue notes + retrieval are economically more realistic.
A few view the result as a useful null: agents don’t yet self-improve just by “planning harder” or writing their own skills in a vacuum; human guidance and external signals remain crucial.

Related topics