Claude Opus 4.1
Initial impressions & performance
- Early testers report Opus 4.1 feels similar to Opus 4 in casual use: sometimes slightly better at coding and planning, but often slower and not obviously improved.
- Some users see noticeably better adherence to instructions and multi-step plans, especially in Claude Code and long troubleshooting sessions.
- Others say it performs worse than Opus 4.0 in Claude Code, with more mistakes and a “Sonnet-like” feel.
Benchmarks, versioning, and expectations
- Many note Anthropic’s own charts show only modest gains; some argue improvements look small enough to be noise or “one more training run.”
- Others point to specific coding benchmarks (e.g., “agentic coding,” junior dev evals) where 4.1’s jump is described as a full standard deviation and “a big improvement.”
- The minor version bump (4 → 4.1) is seen as signaling incremental, not transformative, progress; some lament a perceived slowdown in frontier-model leaps.
Opus vs Sonnet for coding
- Strong disagreement:
- One camp: Opus is clearly superior for complex reasoning, debugging, architecture, long unsupervised tasks, and big-picture analysis.
- Another camp: Sonnet is faster, cheaper, more predictable, and often “good enough” for interactive coding; some even call Sonnet “much better overall.”
- Common hybrid strategies:
- Opus for design, analysis, planning, or “plan mode”; Sonnet for implementation and routine edits.
- Use Sonnet by default, switch to Opus when Sonnet gets “stuck” or hallucinates.
- Several note Opus is “ridiculously overpriced” via API and only attractive under subscription plans.
Pricing, economics, and limits
- Heavy users complain Opus API costs and Claude Max usage caps make serious work difficult; some hit Opus limits within minutes.
- Others report excellent economics on Max plans when combined with caching and disciplined model selection; tools like
ccusageare used to estimate “real” API-equivalent spend. - Debate over whether Opus’s marginal quality gain justifies ~10x Sonnet’s price, especially when differences feel small in practice.
Release timing & competition
- Many notice multiple labs (Anthropic, OpenAI, others) releasing models within hours and interpret it as PR “counterprogramming,” not pure coincidence.
- Some speculate Anthropic’s teaser about “substantially larger improvements in the coming weeks” is partly defensive against an anticipated GPT-5 launch.
- Others with industry experience argue coordinated release-by-vibes is overstated: real launches take weeks of prep and are often queued, then timed for attention.
Claude Code, tools, and onboarding confusion
- There is extensive discussion that the ecosystem (Claude web, Claude Code CLI, API, third-party IDEs like Cursor/Cline/Copilot, multiple models/tiers) feels overwhelming to newcomers.
- Suggested “simple starts”:
- Pay for Claude Pro or Max and use Claude Code in a terminal, with your usual editor.
- Or install Cursor (VS Code-based) and switch between Sonnet/Opus there.
- Clarifications:
- Claude Code can be used via subscription or per-token billing and essentially wraps the API with an agentic, project-wide editing loop.
- Sub-agents in Claude Code are highlighted as powerful for isolating context, delegating sub-tasks, and combining models.
Quality regressions, slowness, and behavior
- Several users complain that:
- Opus 4.1 and Sonnet 4 feel slower at times.
- Sonnet’s style has drifted toward more filler, lists, and “sycophancy,” undermining earlier appeal.
- On some days, overall output quality feels degraded, with more flailing and less crisp reasoning.
- Others counter that expectations rise quickly, projects grow in size, and “context rot” or long sessions might explain perceived decline.
Benchmarks, reliability, and skepticism
- Some external benchmarks (e.g., LLM-to-SQL) reportedly do not show Opus 4.1 topping Opus 4.0, raising questions about Anthropic’s highlighted metrics.
- Users call for more rigorous, repeated, statistically sound benchmarking instead of single-run numbers and glossy charts.
- There is skepticism that frontier models may be overfitted to benchmark suites, reducing their value as indicators of real-world performance.
Openness and model strategy
- One thread criticizes Anthropic for never open-sourcing models, branding them “less open” than some competitors.
- Others note positives on “openness of behavior”: visible chain-of-thought in some settings, explicit “thinking budget,” and relatively low-friction API access compared to KYC-heavy rivals.
- No consensus emerges on whether this constitutes meaningful openness versus just better product ergonomics.
Productivity claims and limits
- Some report dramatic productivity boosts (2–10x) using Claude Code for refactors, test coverage, CI pipelines, and tech-debt cleanup; others argue such gains are overstated.
- A recurring theme: the new bottleneck is code review and trust. Reviewing AI-generated code (which you didn’t author) can be slower and cognitively heavier, capping real-world speedups.
- A few emphasize that large wins often come from using LLMs to tackle tasks previously too tedious to attempt at all, not just speeding up existing workflows.