GPT-5 vs. Sonnet: Complex Agentic Coding

Scope of Comparison: GPT‑5 vs Sonnet vs Opus

  • Several commenters wanted GPT‑5 compared to Claude Opus, not Sonnet, arguing “best vs best” matters more than price in principle.
  • Others countered that Opus is effectively unusable for most engineers due to 10×+ API cost and huge token usage in agentic coding, making Sonnet the more realistic comparison point.
  • Some report Opus underperforming in GitHub Copilot vs doing very well in Claude Code, suggesting environment and prompts matter more than raw model.

Pricing, Subsidies, and “Best Value”

  • Multiple people note that Anthropic and others are likely subsidizing usage; fixed‑price plans are described as “sweetheart deals.”
  • GitHub Copilot’s 1× / 10× multipliers were clarified as quota cost factors (e.g., Opus at 10×).
  • Opinions differ on “best value”:
    • Copilot seen as good if paid for by employer and especially valuable via VS Code’s LM API “unlimited” 4.1 usage.
    • Others prefer Claude’s $20–$100 plans or pay‑per‑use via OpenRouter to mix models.
    • Many emphasize company budgets vs individuals: Opus is expensive personally but cheap relative to an engineer’s salary.

Tooling, Harnesses, and Native Environments

  • Strong consensus that “agenticness” is dominated by the harness: Claude Code, Cursor, Copilot, Codex, Roo Code, various Neovim plugins, etc. give very different results with the same model.
  • Claude Code is widely praised for hooks, CLAUDE.md, layered “memory” files, and plan‑then‑build workflows; but it often ignores instructions and needs deterministic wrappers and external linters/formatters.
  • Copilot is polarizing: some call it “garbage,” others find it best across IDEs; several say its prompts/context mgmt make the same models perform worse than in Claude Code/Cursor.
  • Many stress that models perform best in their native stacks (Claude ↔ Claude Code, GPT‑5 ↔ Codex/Copilot).

Model Behavior & UX

  • GPT‑5 is described as: stronger planner, good at “thinking then acting once,” sometimes less creative, often slow/over‑thinking and “doing the wrong thing” in practice.
  • Sonnet / Opus: more “muddling” with many small attempts and recoveries, better at large real codebases but chattier and more token‑hungry; Opus seen as needing more babysitting.
  • Some users say GPT‑5 solved issues Claude couldn’t; others found Claude Code + Sonnet/Opus still more effective and less stuck than GPT‑5 in Cursor/Codex.
  • Claude’s “You’re absolutely right” sycophancy annoys people; users work around it with custom instructions and memory, though adherence is imperfect.

Workflows, Hooks, and Guardrails

  • Advanced users enforce TDD and style via Claude Code hooks, pre/post commands, and custom guards; there’s excitement but also surprise this space is under‑explored.
  • Suggested hybrid workflows: use a “smart planner” model (e.g., GPT‑5, Gemini) to create specs/plans, then a cheaper or more reliable agent (e.g., Sonnet via Claude Code, GPT‑4.1) to implement stepwise.
  • Deterministic wrappers (no‑emoji filters, mandatory format/lint hooks) are considered essential for reliability.

Evaluation Skepticism and Subjectivity

  • Many note that anecdotal “model X is better” claims vary wildly by task, language, and tool; non‑determinism and prompt differences make comparisons noisy.
  • Commenters criticize the article’s methodology as “pure vibe,” highly sensitive to temporary latency and Copilot‑specific tuning.
  • There’s concern about blurred lines between technical reviews and marketing, and recognition that no robust, trusted benchmarks yet capture frontier‑model differences for agentic coding.