Claude Sonnet 4.5

Model positioning and Opus vs Sonnet

  • Confusion over whether Opus is still Anthropic’s “best” model: benchmarks show Sonnet 4.5 surpassing Opus 4.1 on code and math, but many users still prefer Opus 4.1 in practice, especially for planning and strictness.
  • Several expect an Opus 4.5 to re-establish the tiering; some Max subscribers feel they’re now “overpaying for Opus” if Sonnet is better.

Real‑world coding performance (Claude vs GPT‑5‑Codex vs others)

  • Experiences are sharply split:
    • Some report Sonnet 4.5 + Claude Code as the best they’ve used: faster, more focused, strong at refactors, infra debugging, tests, and math-heavy tasks; code interpreter + tools can handle non‑trivial projects.
    • Others find it clearly worse than GPT‑5‑Codex: superficial changes, poor decision‑making, failure to reuse existing auth or harnesses, flaky tool use, and more “giving up” or overengineering.
  • Gemini 2.5 Pro frequently praised for very large context and planning; often used to plan while Claude or Codex does implementation.
  • Several note all models degrade as context fills and require active context management (/context, /compact, /new, logs, design docs).

Benchmarks, “nerfs,” and trust

  • Users point out that Anthropic’s SWEbench numbers don’t match the public leaderboard and worry about overfitting and “benchmaxxing.”
  • Persistent suspicion that models launch in a “buffed” state and are quietly optimized/nerfed weeks later; calls for week‑by‑week evals and time‑to‑completion metrics, not just accuracy.

Agents, long tasks, and computer use

  • The “30‑hour Slack clone” claim draws both excitement and skepticism: likely depends heavily on custom tools, guardrails, and context management not available to typical users.
  • Many report agents in Claude Code, Codex, etc. can do real damage (e.g., git reset --hard, reverting user changes) or get stuck in loops unless carefully constrained and supervised.
  • Desire for better logging, reproducibility, and determinism for agentic runs; current workflows are perceived as brittle “black boxes.”

Tooling, APIs, and cost

  • Common practice is to abstract over multiple providers via OpenRouter, LiteLLM, AI SDK, Bedrock, or bespoke wrappers (e.g., alias‑based registries).
  • Anthropic is seen as noticeably more expensive than OpenAI/xAI/Qwen; some say that makes it unusable in tools like Cursor, pushing them to Grok Code Fast or Qwen coder models.
  • Regional pricing and subscription quirks (rate limits, pause bugs, missing ZIP upload, strict safety filters) are recurring pain points.

Culture, alignment, and career impacts

  • Claude’s “You’re absolutely right!” tic and general sycophancy are widely discussed; some see it as an alignment tactic, others as harmful to user thinking.
  • Guardrails: Claude and ChatGPT often refuse sensitive or sexual/violent topics; Grok, Gemini, and open‑weight models are described as looser but riskier.
  • Emotional responses range from enthusiasm (“3x output, many new projects shipped”) to disillusionment (“AI coding doesn’t really work,” loss of craftsmanship, juniors being displaced).
  • Many argue that architecture, taste, debugging, and supervising agents remain central, even if rote coding is increasingly automated.