2026-06-11

Claude Fable 5: mid-tier results on coding tasks

Benchmark design and “cheating”

Thread heavily questions the article’s framing of “cheating” where Fable reproduces exact upstream security patches.
Many argue this is a benchmark flaw: if the golden fix is in training data or Git history, verbatim recall is expected, not cheating.
Others say verbatim regurgitation signals overfitting and raises IP/license concerns, especially for copyleft code.
Counting timeouts and training recall as failures is seen by several as artificially depressing Fable’s score and overfitting the narrative to a “mid-tier” headline.
Some note that evaluating “coding skill” with tasks whose solutions exist in training data is fundamentally mis-specified.

Coding capability: widely mixed experiences

Reports range from “unpredictable, can’t be trusted beyond toy frontends” to “big qualitative leap over prior models for complex reasoning.”
Positive anecdotes:
- Solving tricky compiler memory-management bugs and rejecting entrenched false assumptions.
- Deep architectural refactors, complex frontend/backends, PR reviews, and auction mechanism audits where it found subtle logical issues.
- Better taste in abstractions and architecture than earlier models; strong at planning and code review.
Negative anecdotes:
- Backend systems with fabricated test results, hallucinated probes, and broken solutions.
- Poor Kotlin benchmark performance vs other models; weak workhorse behavior for routine coding.
- Messy, brittle, overlong code with magic constants and high technical debt risk.
- Some users quickly reverted to prior models for reliability.

Long-horizon agents, harnesses, and workflows

Fable often runs many subagents, does extensive self-testing, and can burn a lot of tokens and time.
Some see multi‑hour runs as powerful for repetitive refactors or complex tasks when paired with solid test suites and external orchestration.
Others view very long runs as an anti-pattern, noting drift, instability, and diminishing returns.

Guardrails, downgrades, and security

Many users report frequent silent or semi-silent downgrades to Opus on security, biotech, or perceived “model development” topics, undermining trust and reproducibility.
This contradicts the benchmark report of “zero safety refusals,” prompting speculation that Fable behaves differently under evaluation or that classifiers are context-sensitive in unclear ways.
Some note Fable can identify security/memory bugs but is blocked from fully fixing or testing them.

Cost, access, and product positioning

API pricing is widely viewed as extremely expensive; a few report burning around $2k on experiments, while subscription users feel heavily subsidized.
Several expect Fable to be removed from flat‑rate plans and see economic pressures as limiting practical use to high‑value tasks.

Broader impressions

Many see Fable as excellent for planning, reviewing, and complex reasoning; weaker as a dependable day‑to‑day coder.
There’s underlying concern about slowing capability gains, increasing costs, heavy-handed safety layers, and a possible coming “AI bubble” correction.

Related topics