2026-01-29

Claude Code daily benchmarks for degradation tracking

Benchmark design & statistical concerns

The tracker shows ~4% lower SWE-bench-Pro accuracy over a month, but many argue the daily results are too noisy to interpret.
Only 50 tasks are run once per day; SWE-bench co-author suggests ≥300 tasks and multiple runs daily, then averaging, to reduce variance.
Several commenters say the “±14% significance threshold” and current confidence-interval logic are not statistically sound for claiming “significant” changes.
Others note the baseline choice (start at Jan 8) feels arbitrary and could look cherry‑picked without clearer justification.
Some suggest testing against an open-source model as a control to detect drift in the benchmark itself.

Is there real degradation? Mixed experiences

Some heavy users report Claude Code / Opus 4.5 “feels” clearly worse in the last weeks: more confusion, loops, missing simple fixes, ignoring SOPs or instructions, worse prompt adherence in non‑coding tasks.
Others report no regression on their own stable coding benchmarks, or even steady improvement as they refine workflows and prompts.
Several note the “honeymoon–hangover” effect: as you learn a tool’s limits and your tasks become harder, it can feel like degradation even if the model is unchanged.

Potential causes of variation (speculative in-thread)

Changes to Claude Code’s harness, tools, or system prompts—rather than the base model—are widely suspected; the tracker itself uses the frequently updated CLI.
Infrastructure bugs and non-deterministic inference under different batch/load conditions are raised as non-malicious explanations.
Many speculate about deliberate cost optimizations: quantization, smaller models under load, reduced “thinking time,” or fewer experts in MoE, despite Anthropic’s public claim that they never reduce model quality due to demand.
Others argue the drift could be from A/B testing of prompts/tools or normal stochastic variation.

Time-of-day effects and load

Multiple anecdotes claim worse quality in US peak hours and better results early morning/holidays; some attribute this to load-based changes, others to human factors (fatigue, expectation bias).

Value of third-party tracking & calls for rigor

Commenters see independent, longitudinal evals as essential to detect silent quality changes, especially as providers face cost pressure.
Many urge making this tracker more statistically rigorous (larger sample, intraday runs, clearer CI methodology) and extending it to more models and providers.

Anthropic’s response

A Claude Code team member confirms a “harness issue” introduced on Jan 26 and rolled back on Jan 28, affecting the app’s tooling/agent loop, not the base model. They say new evals were added to catch this class of bug.

Related topics