Claude Code daily benchmarks for degradation tracking
Benchmark design & statistical concerns
- The tracker shows ~4% lower SWE-bench-Pro accuracy over a month, but many argue the daily results are too noisy to interpret.
- Only 50 tasks are run once per day; SWE-bench co-author suggests ≥300 tasks and multiple runs daily, then averaging, to reduce variance.
- Several commenters say the “±14% significance threshold” and current confidence-interval logic are not statistically sound for claiming “significant” changes.
- Others note the baseline choice (start at Jan 8) feels arbitrary and could look cherry‑picked without clearer justification.
- Some suggest testing against an open-source model as a control to detect drift in the benchmark itself.
Is there real degradation? Mixed experiences
- Some heavy users report Claude Code / Opus 4.5 “feels” clearly worse in the last weeks: more confusion, loops, missing simple fixes, ignoring SOPs or instructions, worse prompt adherence in non‑coding tasks.
- Others report no regression on their own stable coding benchmarks, or even steady improvement as they refine workflows and prompts.
- Several note the “honeymoon–hangover” effect: as you learn a tool’s limits and your tasks become harder, it can feel like degradation even if the model is unchanged.
Potential causes of variation (speculative in-thread)
- Changes to Claude Code’s harness, tools, or system prompts—rather than the base model—are widely suspected; the tracker itself uses the frequently updated CLI.
- Infrastructure bugs and non-deterministic inference under different batch/load conditions are raised as non-malicious explanations.
- Many speculate about deliberate cost optimizations: quantization, smaller models under load, reduced “thinking time,” or fewer experts in MoE, despite Anthropic’s public claim that they never reduce model quality due to demand.
- Others argue the drift could be from A/B testing of prompts/tools or normal stochastic variation.
Time-of-day effects and load
- Multiple anecdotes claim worse quality in US peak hours and better results early morning/holidays; some attribute this to load-based changes, others to human factors (fatigue, expectation bias).
Value of third-party tracking & calls for rigor
- Commenters see independent, longitudinal evals as essential to detect silent quality changes, especially as providers face cost pressure.
- Many urge making this tracker more statistically rigorous (larger sample, intraday runs, clearer CI methodology) and extending it to more models and providers.
Anthropic’s response
- A Claude Code team member confirms a “harness issue” introduced on Jan 26 and rolled back on Jan 28, affecting the app’s tooling/agent loop, not the base model. They say new evals were added to catch this class of bug.