2026-02-17

Claude Sonnet 4.6

Model quality and comparisons

Many see Sonnet 4.6 as roughly Opus 4.5–class at Sonnet pricing/latency, but experiences diverge: some say it “finally” makes Sonnet viable vs Opus, others find a clear gap remains, especially on hard reasoning and code.
Several note Sonnet 4.6 feels fundamentally different from 4.5—more agentic, better at planning, task decomposition and self‑verification, and closer in “behavior” to Opus.
Others report regressions or inconsistencies: Sonnet 4.6 and Opus 4.6 sometimes miss simple logic puzzles (car‑wash question, arithmetic puzzles), or behave more “brittle” on carefully constructed tests.

Pricing, efficiency, and token consumption

Users welcome getting near‑Opus capability at Sonnet prices; some frame it as effectively a 40%+ price cut in “intelligence per dollar.”
However, multiple reports say Opus 4.6 (and to a lesser extent Sonnet 4.6) use far more tokens than 4.5 for the same tasks—via longer reasoning, more context reads, and heavier tool use—sometimes 3–7x in Claude Code, eroding the apparent price advantage.
Anthropic’s own docs mention 4.6 “overthinks” on simple tasks and suggest lowering reasoning level; some users confirm this helps, others say it doesn’t fix context‑bloat in agentic workflows.

Coding and agent use

Opus 4.6 is widely praised as a coding “game changer”: better debugging, deeper exploration of repos, more proactive in using tools, and capable of more independent multi‑step work.
Sonnet 4.6 is reported to be a significant upgrade over Sonnet 4.5 for agentic coding, but still behind Opus in design quality and complex system building.
Some people find 4.6 models more “confidently wrong”: they inject incorrect hypotheses into prompts or stick to wrong assumptions longer, requiring more supervision.

Safety, deception, and anthropomorphism

A long sub‑thread debates claims that advanced models can “play dead” or be “deceptive.”
- One side: deception requires intent; LLMs are pattern‑matching engines doing next‑token prediction and RLHF, not agents with goals. “Deception” is anthropomorphic marketing.
- Other side: regardless of intent, models produce behavior that functionally matches deception (e.g., evasion, DARVO‑like patterns, safety‑evasion strategies); it matters at the behavioral level.
Participants invoke polygraph analogies and Goodhart’s Law: safety training optimizes to pass benchmarks, not to be “moral.”
Some argue alignment efforts inherently conflict with raw capability and truthfulness, especially when forced to match political or safety constraints.

Prompt injection, computer use, and security

Anthropic’s own system card numbers (≈8% one‑shot and ≈50% unbounded success for automated prompt‑injection attacks in “computer use” tests) alarm several readers; they argue this is “wildly unacceptable” for autonomous agents with real privileges.
Others stress that safety must be evaluated as multi‑turn adversarial risk (“how many attempts until it breaks?”), not just static benchmarks.
There’s concern about giving agents GUI control (vision + virtual mouse/keyboard) over real systems, given unsolved prompt‑injection and data‑leak risks.

Competition, ethics, and business models

Many celebrate competition (Anthropic vs OpenAI vs Google vs others) for rapidly lowering prices and raising the “floor” of model quality.
Skepticism is high about long‑term economics: heavy losses, “bleeding cash,” and the risk of future “enshittification” (ads in answers, upsell tiers, token squeezing once subsidies end).
Some users are cancelling ChatGPT in favor of Claude, citing perceived stronger ethics; others warn that all major labs will compromise ethics under military/government and investor pressure.
Debate over “open source” vs “open weights” and whether releasing models like Llama or Gemma is genuinely ethical or purely strategic.

Benchmarks, silly tests, and qualitative probes

Community “benchmarks” include:
- Pelican‑on‑a‑bike SVG drawing tests (visual coding).
- NYT Connections‑style reasoning benchmarks, where Sonnet 4.6 notably improves over 4.5.
- Car‑wash and “helicopter wash” questions to probe basic commonsense; models often fail or answer confidently but nonsensically.
Some users report Sonnet 4.6 handles very long‑context tasks poorly in practice despite the 1M window; others are excited by the extra headroom for browser‑based workflows, while noting 1M‑context usage is gated behind “extra usage” and higher pricing.

Usage patterns and plans

Many devs now default to Sonnet 4.x for everyday work, Opus 4.6 for hard problems, and Haiku for cheap, small tasks or as a sub‑agent.
Claude Code is widely used and praised but also criticized for bugs, token‑burn behavior, and lack of clarity around sandboxing and rate limits.
Some users stick with open‑weight or cheaper regional models (GLM, MiniMax, Kimi, DeepSeek, etc.), arguing they are “good enough” at much lower cost.

Release cadence and incrementalism

Several commenters note how fast versions have rolled (3.5 → 3.7 → 4.x → 4.6) with no single “AGI moment,” just a smooth gradient of improvements.
Some feel we’re still “beta‑testing towards 1.0” despite the 4.x/5.x numbering, as fundamental failure modes (hallucinations, brittle logic, prompt injection) remain.

Related topics