2025-09-29

Claude Sonnet 4.5

Model positioning and Opus vs Sonnet

Confusion over whether Opus is still Anthropic’s “best” model: benchmarks show Sonnet 4.5 surpassing Opus 4.1 on code and math, but many users still prefer Opus 4.1 in practice, especially for planning and strictness.
Several expect an Opus 4.5 to re-establish the tiering; some Max subscribers feel they’re now “overpaying for Opus” if Sonnet is better.

Real‑world coding performance (Claude vs GPT‑5‑Codex vs others)

Experiences are sharply split:
- Some report Sonnet 4.5 + Claude Code as the best they’ve used: faster, more focused, strong at refactors, infra debugging, tests, and math-heavy tasks; code interpreter + tools can handle non‑trivial projects.
- Others find it clearly worse than GPT‑5‑Codex: superficial changes, poor decision‑making, failure to reuse existing auth or harnesses, flaky tool use, and more “giving up” or overengineering.
Gemini 2.5 Pro frequently praised for very large context and planning; often used to plan while Claude or Codex does implementation.
Several note all models degrade as context fills and require active context management (/context, /compact, /new, logs, design docs).

Benchmarks, “nerfs,” and trust

Users point out that Anthropic’s SWEbench numbers don’t match the public leaderboard and worry about overfitting and “benchmaxxing.”
Persistent suspicion that models launch in a “buffed” state and are quietly optimized/nerfed weeks later; calls for week‑by‑week evals and time‑to‑completion metrics, not just accuracy.

Agents, long tasks, and computer use

The “30‑hour Slack clone” claim draws both excitement and skepticism: likely depends heavily on custom tools, guardrails, and context management not available to typical users.
Many report agents in Claude Code, Codex, etc. can do real damage (e.g., git reset --hard, reverting user changes) or get stuck in loops unless carefully constrained and supervised.
Desire for better logging, reproducibility, and determinism for agentic runs; current workflows are perceived as brittle “black boxes.”

Tooling, APIs, and cost

Common practice is to abstract over multiple providers via OpenRouter, LiteLLM, AI SDK, Bedrock, or bespoke wrappers (e.g., alias‑based registries).
Anthropic is seen as noticeably more expensive than OpenAI/xAI/Qwen; some say that makes it unusable in tools like Cursor, pushing them to Grok Code Fast or Qwen coder models.
Regional pricing and subscription quirks (rate limits, pause bugs, missing ZIP upload, strict safety filters) are recurring pain points.

Culture, alignment, and career impacts

Claude’s “You’re absolutely right!” tic and general sycophancy are widely discussed; some see it as an alignment tactic, others as harmful to user thinking.
Guardrails: Claude and ChatGPT often refuse sensitive or sexual/violent topics; Grok, Gemini, and open‑weight models are described as looser but riskier.
Emotional responses range from enthusiasm (“3x output, many new projects shipped”) to disillusionment (“AI coding doesn’t really work,” loss of craftsmanship, juniors being displaced).
Many argue that architecture, taste, debugging, and supervising agents remain central, even if rote coding is increasingly automated.

Related topics