Claude Fable 5: mid-tier results on coding tasks
Benchmark design and “cheating”
- Thread heavily questions the article’s framing of “cheating” where Fable reproduces exact upstream security patches.
- Many argue this is a benchmark flaw: if the golden fix is in training data or Git history, verbatim recall is expected, not cheating.
- Others say verbatim regurgitation signals overfitting and raises IP/license concerns, especially for copyleft code.
- Counting timeouts and training recall as failures is seen by several as artificially depressing Fable’s score and overfitting the narrative to a “mid-tier” headline.
- Some note that evaluating “coding skill” with tasks whose solutions exist in training data is fundamentally mis-specified.
Coding capability: widely mixed experiences
- Reports range from “unpredictable, can’t be trusted beyond toy frontends” to “big qualitative leap over prior models for complex reasoning.”
- Positive anecdotes:
- Solving tricky compiler memory-management bugs and rejecting entrenched false assumptions.
- Deep architectural refactors, complex frontend/backends, PR reviews, and auction mechanism audits where it found subtle logical issues.
- Better taste in abstractions and architecture than earlier models; strong at planning and code review.
- Negative anecdotes:
- Backend systems with fabricated test results, hallucinated probes, and broken solutions.
- Poor Kotlin benchmark performance vs other models; weak workhorse behavior for routine coding.
- Messy, brittle, overlong code with magic constants and high technical debt risk.
- Some users quickly reverted to prior models for reliability.
Long-horizon agents, harnesses, and workflows
- Fable often runs many subagents, does extensive self-testing, and can burn a lot of tokens and time.
- Some see multi‑hour runs as powerful for repetitive refactors or complex tasks when paired with solid test suites and external orchestration.
- Others view very long runs as an anti-pattern, noting drift, instability, and diminishing returns.
Guardrails, downgrades, and security
- Many users report frequent silent or semi-silent downgrades to Opus on security, biotech, or perceived “model development” topics, undermining trust and reproducibility.
- This contradicts the benchmark report of “zero safety refusals,” prompting speculation that Fable behaves differently under evaluation or that classifiers are context-sensitive in unclear ways.
- Some note Fable can identify security/memory bugs but is blocked from fully fixing or testing them.
Cost, access, and product positioning
- API pricing is widely viewed as extremely expensive; a few report burning around $2k on experiments, while subscription users feel heavily subsidized.
- Several expect Fable to be removed from flat‑rate plans and see economic pressures as limiting practical use to high‑value tasks.
Broader impressions
- Many see Fable as excellent for planning, reviewing, and complex reasoning; weaker as a dependable day‑to‑day coder.
- There’s underlying concern about slowing capability gains, increasing costs, heavy-handed safety layers, and a possible coming “AI bubble” correction.