Claude Sonnet 4.5
Model positioning and Opus vs Sonnet
- Confusion over whether Opus is still Anthropic’s “best” model: benchmarks show Sonnet 4.5 surpassing Opus 4.1 on code and math, but many users still prefer Opus 4.1 in practice, especially for planning and strictness.
- Several expect an Opus 4.5 to re-establish the tiering; some Max subscribers feel they’re now “overpaying for Opus” if Sonnet is better.
Real‑world coding performance (Claude vs GPT‑5‑Codex vs others)
- Experiences are sharply split:
- Some report Sonnet 4.5 + Claude Code as the best they’ve used: faster, more focused, strong at refactors, infra debugging, tests, and math-heavy tasks; code interpreter + tools can handle non‑trivial projects.
- Others find it clearly worse than GPT‑5‑Codex: superficial changes, poor decision‑making, failure to reuse existing auth or harnesses, flaky tool use, and more “giving up” or overengineering.
- Gemini 2.5 Pro frequently praised for very large context and planning; often used to plan while Claude or Codex does implementation.
- Several note all models degrade as context fills and require active context management (/context, /compact, /new, logs, design docs).
Benchmarks, “nerfs,” and trust
- Users point out that Anthropic’s SWEbench numbers don’t match the public leaderboard and worry about overfitting and “benchmaxxing.”
- Persistent suspicion that models launch in a “buffed” state and are quietly optimized/nerfed weeks later; calls for week‑by‑week evals and time‑to‑completion metrics, not just accuracy.
Agents, long tasks, and computer use
- The “30‑hour Slack clone” claim draws both excitement and skepticism: likely depends heavily on custom tools, guardrails, and context management not available to typical users.
- Many report agents in Claude Code, Codex, etc. can do real damage (e.g.,
git reset --hard, reverting user changes) or get stuck in loops unless carefully constrained and supervised. - Desire for better logging, reproducibility, and determinism for agentic runs; current workflows are perceived as brittle “black boxes.”
Tooling, APIs, and cost
- Common practice is to abstract over multiple providers via OpenRouter, LiteLLM, AI SDK, Bedrock, or bespoke wrappers (e.g., alias‑based registries).
- Anthropic is seen as noticeably more expensive than OpenAI/xAI/Qwen; some say that makes it unusable in tools like Cursor, pushing them to Grok Code Fast or Qwen coder models.
- Regional pricing and subscription quirks (rate limits, pause bugs, missing ZIP upload, strict safety filters) are recurring pain points.
Culture, alignment, and career impacts
- Claude’s “You’re absolutely right!” tic and general sycophancy are widely discussed; some see it as an alignment tactic, others as harmful to user thinking.
- Guardrails: Claude and ChatGPT often refuse sensitive or sexual/violent topics; Grok, Gemini, and open‑weight models are described as looser but riskier.
- Emotional responses range from enthusiasm (“3x output, many new projects shipped”) to disillusionment (“AI coding doesn’t really work,” loss of craftsmanship, juniors being displaced).
- Many argue that architecture, taste, debugging, and supervising agents remain central, even if rote coding is increasingly automated.