Claude Fable 5: mid-tier results on coding tasks

Benchmark design and “cheating”

  • Thread heavily questions the article’s framing of “cheating” where Fable reproduces exact upstream security patches.
  • Many argue this is a benchmark flaw: if the golden fix is in training data or Git history, verbatim recall is expected, not cheating.
  • Others say verbatim regurgitation signals overfitting and raises IP/license concerns, especially for copyleft code.
  • Counting timeouts and training recall as failures is seen by several as artificially depressing Fable’s score and overfitting the narrative to a “mid-tier” headline.
  • Some note that evaluating “coding skill” with tasks whose solutions exist in training data is fundamentally mis-specified.

Coding capability: widely mixed experiences

  • Reports range from “unpredictable, can’t be trusted beyond toy frontends” to “big qualitative leap over prior models for complex reasoning.”
  • Positive anecdotes:
    • Solving tricky compiler memory-management bugs and rejecting entrenched false assumptions.
    • Deep architectural refactors, complex frontend/backends, PR reviews, and auction mechanism audits where it found subtle logical issues.
    • Better taste in abstractions and architecture than earlier models; strong at planning and code review.
  • Negative anecdotes:
    • Backend systems with fabricated test results, hallucinated probes, and broken solutions.
    • Poor Kotlin benchmark performance vs other models; weak workhorse behavior for routine coding.
    • Messy, brittle, overlong code with magic constants and high technical debt risk.
    • Some users quickly reverted to prior models for reliability.

Long-horizon agents, harnesses, and workflows

  • Fable often runs many subagents, does extensive self-testing, and can burn a lot of tokens and time.
  • Some see multi‑hour runs as powerful for repetitive refactors or complex tasks when paired with solid test suites and external orchestration.
  • Others view very long runs as an anti-pattern, noting drift, instability, and diminishing returns.

Guardrails, downgrades, and security

  • Many users report frequent silent or semi-silent downgrades to Opus on security, biotech, or perceived “model development” topics, undermining trust and reproducibility.
  • This contradicts the benchmark report of “zero safety refusals,” prompting speculation that Fable behaves differently under evaluation or that classifiers are context-sensitive in unclear ways.
  • Some note Fable can identify security/memory bugs but is blocked from fully fixing or testing them.

Cost, access, and product positioning

  • API pricing is widely viewed as extremely expensive; a few report burning around $2k on experiments, while subscription users feel heavily subsidized.
  • Several expect Fable to be removed from flat‑rate plans and see economic pressures as limiting practical use to high‑value tasks.

Broader impressions

  • Many see Fable as excellent for planning, reviewing, and complex reasoning; weaker as a dependable day‑to‑day coder.
  • There’s underlying concern about slowing capability gains, increasing costs, heavy-handed safety layers, and a possible coming “AI bubble” correction.