2025-08-08

GPT-5 vs. Sonnet: Complex Agentic Coding

Scope of Comparison: GPT‑5 vs Sonnet vs Opus

Several commenters wanted GPT‑5 compared to Claude Opus, not Sonnet, arguing “best vs best” matters more than price in principle.
Others countered that Opus is effectively unusable for most engineers due to 10×+ API cost and huge token usage in agentic coding, making Sonnet the more realistic comparison point.
Some report Opus underperforming in GitHub Copilot vs doing very well in Claude Code, suggesting environment and prompts matter more than raw model.

Pricing, Subsidies, and “Best Value”

Multiple people note that Anthropic and others are likely subsidizing usage; fixed‑price plans are described as “sweetheart deals.”
GitHub Copilot’s 1× / 10× multipliers were clarified as quota cost factors (e.g., Opus at 10×).
Opinions differ on “best value”:
- Copilot seen as good if paid for by employer and especially valuable via VS Code’s LM API “unlimited” 4.1 usage.
- Others prefer Claude’s $20–$100 plans or pay‑per‑use via OpenRouter to mix models.
- Many emphasize company budgets vs individuals: Opus is expensive personally but cheap relative to an engineer’s salary.

Tooling, Harnesses, and Native Environments

Strong consensus that “agenticness” is dominated by the harness: Claude Code, Cursor, Copilot, Codex, Roo Code, various Neovim plugins, etc. give very different results with the same model.
Claude Code is widely praised for hooks, CLAUDE.md, layered “memory” files, and plan‑then‑build workflows; but it often ignores instructions and needs deterministic wrappers and external linters/formatters.
Copilot is polarizing: some call it “garbage,” others find it best across IDEs; several say its prompts/context mgmt make the same models perform worse than in Claude Code/Cursor.
Many stress that models perform best in their native stacks (Claude ↔ Claude Code, GPT‑5 ↔ Codex/Copilot).

Model Behavior & UX

GPT‑5 is described as: stronger planner, good at “thinking then acting once,” sometimes less creative, often slow/over‑thinking and “doing the wrong thing” in practice.
Sonnet / Opus: more “muddling” with many small attempts and recoveries, better at large real codebases but chattier and more token‑hungry; Opus seen as needing more babysitting.
Some users say GPT‑5 solved issues Claude couldn’t; others found Claude Code + Sonnet/Opus still more effective and less stuck than GPT‑5 in Cursor/Codex.
Claude’s “You’re absolutely right” sycophancy annoys people; users work around it with custom instructions and memory, though adherence is imperfect.

Workflows, Hooks, and Guardrails

Advanced users enforce TDD and style via Claude Code hooks, pre/post commands, and custom guards; there’s excitement but also surprise this space is under‑explored.
Suggested hybrid workflows: use a “smart planner” model (e.g., GPT‑5, Gemini) to create specs/plans, then a cheaper or more reliable agent (e.g., Sonnet via Claude Code, GPT‑4.1) to implement stepwise.
Deterministic wrappers (no‑emoji filters, mandatory format/lint hooks) are considered essential for reliability.

Evaluation Skepticism and Subjectivity

Many note that anecdotal “model X is better” claims vary wildly by task, language, and tool; non‑determinism and prompt differences make comparisons noisy.
Commenters criticize the article’s methodology as “pure vibe,” highly sensitive to temporary latency and Copilot‑specific tuning.
There’s concern about blurred lines between technical reviews and marketing, and recognition that no robust, trusted benchmarks yet capture frontier‑model differences for agentic coding.

Related topics