2024-06-27

Claude 3.5 Sonnet

Overall impressions & model comparisons

Many commenters find Claude 3.5 Sonnet extremely strong, often preferring it over GPT‑4/4o for coding, data-heavy tasks, and “human-like” language.
Others report the opposite: GPT‑4o feels more capable, especially for assistant-style reasoning and calculus/physics; experiences are clearly mixed.
Some see Sonnet as slightly ahead of GPT‑4o on coding and extraction from long documents; Gemini is mentioned for much larger context windows.
Benchmarks are viewed skeptically: several note that leaderboard scores don’t match their day‑to‑day experience.

Coding ability & tools

Strong praise for Sonnet 3.5 as a coding assistant: “junior engineer or better,” very fast at prototyping, refactors, infra planning, Dockerization, tests, docs, etc.
Works especially well on greenfield tasks or small to medium codebases; less reliable when deeply entangled with large existing systems or modern idiomatic framework patterns.
Users mention workflows with IDE integrations and agents (Cursor, Cody, Aider, Sweep, custom bots) and note that semi‑autonomous PR agents are still mediocre (~25% success on SWE‑bench).

Reasoning, math, and consistency

Some say Claude is better at careful, step‑by‑step reasoning and ambiguity handling; others show math/physics prompts where Claude fails and GPT is correct.
A recurring theme is Claude 3.5’s improved consistency: fewer wild swings in quality once a good prompt style is found.

UX, pricing, and limits

Claude Pro’s opaque usage limits frustrate users; message caps are token‑dependent and capacity‑dependent, which feels unpredictable.
OpenAI’s consumer products also have caps and dynamic throttling; both sides are criticized for lack of transparency.
Projects (persistent context with files/instructions) and Artifacts are seen as major productivity features; some wish for repo integration and voice interfaces.
Account creation friction: phone-number requirement and blocking of Google Voice numbers turn some users away.

Safety, bans, and reliability

Some accounts are auto‑banned with little explanation; appeal flows exist but are slow or inconsistent.
Claude’s safety filters are stricter than GPT’s in some areas (e.g., code obfuscation), which some see as overreach.
Occasional dangerous suggestions (e.g., rm -rf on keyring data) show that safety and caution are still imperfect.

Broader impacts

Strong sense that modern LLMs dramatically accelerate experienced developers, especially on side projects.
Debate over whether this threatens software jobs or mainly raises the bar for developers who can direct and verify AI‑generated code.

Related topics