2025-03-31

Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison

Access, Pricing, and Context Window

Gemini 2.5 Pro is now available free via Google’s web interface and AI Studio; some users note region or app limitations and mention Google One “AI Premium” as another route.
The 1M-token context window draws attention; people distinguish between theoretical max vs effective recall.
Several report Gemini models doing very well on “needle in a haystack” retrieval, but others question whether vendor benchmarks are independently replicated and note more challenging long-context tests.

Coding Quality: Gemini vs Claude vs Others

Experiences are highly split:
- Some find Gemini 2.5 Pro strongest at writing new code from scratch and complex reasoning over code, including finding subtle threading/logic issues.
- Many others say it’s worse than Claude 3.5/3.7 (and far behind o1 Pro) for everyday coding and large refactors, especially on 50k+ token prompts.
Common Gemini complaints: touches unrelated code, aggressive refactors, obsesses over “advanced” changes (e.g., removing GIL, OpenMP, optimizations that hurt common cases), omits subroutines or replaces them with stubs/comments, or writes “same as before” instead of full code.
Claude 3.7 is often described as more agentic but less obedient than 3.5, prone to overediting, chasing linter issues, or rewriting whole modules when only a move/rename was requested.
o1 Pro is widely regarded as best for hard debugging, but too expensive for many.

Greenfield Demos vs Real Projects

Many criticize the article’s tests as “toy” greenfield tasks (games, small apps) that any strong model can handle.
Multiple commenters say the real challenge is modifying large, messy existing codebases, respecting constraints, and not exploding tech debt.
Several demand benchmarks that involve adding features or ports in real OSS projects (e.g., porting a GTK3 UI layer to GTK4), with one maintainer explicitly offering such a task as a “can LLMs really code?” benchmark.

Tooling, Prompts, and Temperature

Results vary strongly with tooling:
- Claude Code, Cursor, Windsurf, Aider, Cline, Roo, and MCP-based setups all get mentioned; some tools seem tuned for Claude and underutilize Gemini.
- Users suggest diff-only/system prompts, low temperature (~0–0.4) for reliable edits, and using AI mainly as an “intern + reviewer” rather than for full rewrites.
Feeding up-to-date docs and repo “flattening” scripts is reported to dramatically improve behavior, especially for non-mainstream APIs and libraries.

Safety, Refusals, and Model Personality

Gemini sometimes refuses risky or “sloppy” solutions (SQL DELETEs, insecure networking, routing hacks), even ending a session with firm disclaimers.
Some appreciate this pushback as more honest than models that “yes-man” bad ideas; others see it as overbearing and want an override.

Hype, Benchmarks, and Overall View

Several call the blog post biased marketing, note overblown language, and warn against extrapolating broad claims from a few hand-picked examples.
Benchmarks like SWE-Bench, aider’s coding leaderboard, LM Arena, etc. are referenced, but differences between top models are seen as incremental, not decisive.
A recurring theme: for most developers, any major provider’s top model is “good enough”; intelligence feels commoditized, and the real moat is tooling and integration.
Many remain skeptical of claims that LLMs will soon replace most software engineers; they see them as powerful assistants for well-scoped tasks, but poor at sustained, large-scale, real-world coding without heavy human guidance.

Related topics