SkillsBench: Benchmarking how well agent skills work across diverse tasks
Degradation from Multi-Layered LLM Use
- Several commenters report that stacking LLM layers (plan → design → implementation all by AI) degrades quality: the more layers delegated, the messier and less maintainable the result.
- This is framed as an “open-loop” problem: without feedback, verification, or human steering, each layer compounds errors and vagueness.
- Some describe a “semantic collapse” effect when LLM outputs are repeatedly re-fed (for text, code, or images), likened to a telephone game; fresh human input is needed to reset quality.
- Others note that context size and reset don’t fully fix it; LLM-produced tokens seem weaker as inputs than human-written ones, even with new sessions.
What the Paper Actually Tests (and Why Many Find It Misleading)
- The paper’s “self-generated skills” are created before doing the task, with no tool access, no web search, no codebase exploration, and no fresh context restart.
- Many argue this setup is unrealistic: it forces the model to write generic how-to docs from its own latent knowledge, then “use” those same hallucinations.
- Commenters stress this is not how practitioners use skills; they consider the negative result unsurprising and of limited practical relevance.
How Practitioners Really Use Skills
- Common real-world pattern: solve or attempt a task with the model, steer it, then distill what was learned into a skill; refine that skill across future runs.
- Skills are seen as:
- Project- or org-specific memory: infra details, codebase patterns, internal tools, domain quirks, team preferences.
- A compression/cache of reasoning to cut repeated exploration and token use on recurring tasks.
- Guardrails: “what not to do”, constraints, and quality rules.
- Self-generated skills are considered useful only when backed by new information: research results, experiments, proprietary docs, or human clarification—not just the model rephrasing what it already “knows”.
Interpretation of the Results (Curated vs Self-Generated Skills)
- The reported gap—self-generated skills slightly harmful (–1.3pp) vs curated skills strongly helpful (+16.2pp)—matches many practitioners’ experience: LLMs are better consumers than producers of procedural knowledge.
- The large gains in underrepresented domains (e.g. +51.9pp in healthcare vs +4.5pp in SWE) are seen as evidence that skills matter most where model priors are weak and knowledge is specialized/proprietary.
- Commenters suggest the “missing condition” is human–AI co-created skills with real feedback; they expect this would outperform both raw and pre-written skill setups.
Risks, Limits, and Broader Reflections
- Uncurated self-generated docs can codify and spread bad practices, especially in code, if teams treat them as “best practices” without review.
- Some see skills and markdown memories as a crutch until true continual learning (weight updates) is feasible; others argue notes + retrieval are economically more realistic.
- A few view the result as a useful null: agents don’t yet self-improve just by “planning harder” or writing their own skills in a vacuum; human guidance and external signals remain crucial.