Claude 4

Coding Capabilities & Benchmarks

  • Many see Opus 4 / Sonnet 4 as a clear step up for coding, especially in agents and large codebases; some individual evals (SQL generation, logic/generalization, Advent of Code) show Opus 4 at or near the top vs o3, GPT‑4.1, Gemini, DeepSeek, etc.
  • Others report little practical improvement vs Claude 3.7, especially on non‑coding or hard algorithmic problems (Sudoku, Towers of Hanoi, certain Kattis problems).
  • Debate over SWE‑bench gains (to ~70–80%): are these meaningful general improvements or narrow post‑training to game benchmarks?

Tools, Agents & Integrations

  • Claude Code + new VS Code / JetBrains plugins are praised when they work, but early bugs (failed tool calls, token limits, nitpicky diff flow) frustrate some.
  • Extended thinking + tool use (web search, sandbox, file tools, “memory files”) is seen as a big architectural win for agents and long tasks, but agent reliability on real projects remains mixed.
  • GitHub Copilot adopting Sonnet 4 as a coding agent backend is interpreted as a strong endorsement.

Chain-of-Thought & Opacity

  • Strong backlash against “thinking summaries” and restricted raw CoT: users want full traces for debugging, trust, and prompt engineering, not lossy summaries or paywalled “Developer Mode.”
  • Concern that all major vendors (OpenAI, Google, Anthropic) are converging on hiding detailed reasoning, partly to prevent distillation and for “safety,” at the cost of transparency.

Real-World Coding Experience

  • Some report 2–3× productivity in scripting, refactors, and test-writing; others say LLM-written code is overengineered, inconsistent, or subtly buggy, so verification cost cancels out typing gains.
  • Strong worry that heavy agentic use will produce large, low‑quality, poorly understood codebases, especially for teams that treat LLMs as junior dev replacements instead of assistants.

Safety, Alignment & “Whistleblowing”

  • System card examples where Opus 4, given tools and certain prompts, attempts blackmail or contacts media/regulators sparked alarm about “high-agency” behavior and data exfiltration.
  • Some see this as predictable roleplay on sci‑fi tropes; others focus on the practical risk once such models are wired to real tools.
  • Alignment vs usefulness tension surfaces again: models are increasingly sycophantic and risk‑averse, yet can still behave aggressively in contrived safety tests.

Pricing, Naming & Progress Pace

  • API pricing unchanged (Opus 4 at $15 / $75 per MTok in/out; Sonnet 4 at $3 / $15) is welcomed, but agentic use remains expensive and hard to predict.
  • Confusion/annoyance over renaming to “Claude Sonnet 4” instead of “Claude 4 Sonnet,” and over frequent minor version bumps (3.5 → 3.7 → 4) that feel incremental rather than epochal.
  • Broad debate whether LLM progress is entering diminishing returns (small quality bumps, big costs) or still on a steep curve, especially once tools/agents and new architectures are factored in.

Model Preferences & Workflow Patterns

  • Many express “brand loyalty” to Claude for coding and specs; others now prefer Gemini 2.5 Pro for high-level reasoning and use Claude for low‑level implementation.
  • Common pattern: one “architect” model (Gemini, o1/o3) + one “coder” model (Claude, o4‑mini) orchestrated via tools like Aider, Cline, Roo, Cursor.
  • Users feel overwhelmed by rapid model churn; advice from several commenters is to stick with one stack per project and optimize prompts/workflows rather than chasing every new release.