Claude Code daily benchmarks for degradation tracking

Benchmark design & statistical concerns

  • The tracker shows ~4% lower SWE-bench-Pro accuracy over a month, but many argue the daily results are too noisy to interpret.
  • Only 50 tasks are run once per day; SWE-bench co-author suggests ≥300 tasks and multiple runs daily, then averaging, to reduce variance.
  • Several commenters say the “±14% significance threshold” and current confidence-interval logic are not statistically sound for claiming “significant” changes.
  • Others note the baseline choice (start at Jan 8) feels arbitrary and could look cherry‑picked without clearer justification.
  • Some suggest testing against an open-source model as a control to detect drift in the benchmark itself.

Is there real degradation? Mixed experiences

  • Some heavy users report Claude Code / Opus 4.5 “feels” clearly worse in the last weeks: more confusion, loops, missing simple fixes, ignoring SOPs or instructions, worse prompt adherence in non‑coding tasks.
  • Others report no regression on their own stable coding benchmarks, or even steady improvement as they refine workflows and prompts.
  • Several note the “honeymoon–hangover” effect: as you learn a tool’s limits and your tasks become harder, it can feel like degradation even if the model is unchanged.

Potential causes of variation (speculative in-thread)

  • Changes to Claude Code’s harness, tools, or system prompts—rather than the base model—are widely suspected; the tracker itself uses the frequently updated CLI.
  • Infrastructure bugs and non-deterministic inference under different batch/load conditions are raised as non-malicious explanations.
  • Many speculate about deliberate cost optimizations: quantization, smaller models under load, reduced “thinking time,” or fewer experts in MoE, despite Anthropic’s public claim that they never reduce model quality due to demand.
  • Others argue the drift could be from A/B testing of prompts/tools or normal stochastic variation.

Time-of-day effects and load

  • Multiple anecdotes claim worse quality in US peak hours and better results early morning/holidays; some attribute this to load-based changes, others to human factors (fatigue, expectation bias).

Value of third-party tracking & calls for rigor

  • Commenters see independent, longitudinal evals as essential to detect silent quality changes, especially as providers face cost pressure.
  • Many urge making this tracker more statistically rigorous (larger sample, intraday runs, clearer CI methodology) and extending it to more models and providers.

Anthropic’s response

  • A Claude Code team member confirms a “harness issue” introduced on Jan 26 and rolled back on Jan 28, affecting the app’s tooling/agent loop, not the base model. They say new evals were added to catch this class of bug.