Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison
Access, Pricing, and Context Window
- Gemini 2.5 Pro is now available free via Google’s web interface and AI Studio; some users note region or app limitations and mention Google One “AI Premium” as another route.
- The 1M-token context window draws attention; people distinguish between theoretical max vs effective recall.
- Several report Gemini models doing very well on “needle in a haystack” retrieval, but others question whether vendor benchmarks are independently replicated and note more challenging long-context tests.
Coding Quality: Gemini vs Claude vs Others
- Experiences are highly split:
- Some find Gemini 2.5 Pro strongest at writing new code from scratch and complex reasoning over code, including finding subtle threading/logic issues.
- Many others say it’s worse than Claude 3.5/3.7 (and far behind o1 Pro) for everyday coding and large refactors, especially on 50k+ token prompts.
- Common Gemini complaints: touches unrelated code, aggressive refactors, obsesses over “advanced” changes (e.g., removing GIL, OpenMP, optimizations that hurt common cases), omits subroutines or replaces them with stubs/comments, or writes “same as before” instead of full code.
- Claude 3.7 is often described as more agentic but less obedient than 3.5, prone to overediting, chasing linter issues, or rewriting whole modules when only a move/rename was requested.
- o1 Pro is widely regarded as best for hard debugging, but too expensive for many.
Greenfield Demos vs Real Projects
- Many criticize the article’s tests as “toy” greenfield tasks (games, small apps) that any strong model can handle.
- Multiple commenters say the real challenge is modifying large, messy existing codebases, respecting constraints, and not exploding tech debt.
- Several demand benchmarks that involve adding features or ports in real OSS projects (e.g., porting a GTK3 UI layer to GTK4), with one maintainer explicitly offering such a task as a “can LLMs really code?” benchmark.
Tooling, Prompts, and Temperature
- Results vary strongly with tooling:
- Claude Code, Cursor, Windsurf, Aider, Cline, Roo, and MCP-based setups all get mentioned; some tools seem tuned for Claude and underutilize Gemini.
- Users suggest diff-only/system prompts, low temperature (~0–0.4) for reliable edits, and using AI mainly as an “intern + reviewer” rather than for full rewrites.
- Feeding up-to-date docs and repo “flattening” scripts is reported to dramatically improve behavior, especially for non-mainstream APIs and libraries.
Safety, Refusals, and Model Personality
- Gemini sometimes refuses risky or “sloppy” solutions (SQL DELETEs, insecure networking, routing hacks), even ending a session with firm disclaimers.
- Some appreciate this pushback as more honest than models that “yes-man” bad ideas; others see it as overbearing and want an override.
Hype, Benchmarks, and Overall View
- Several call the blog post biased marketing, note overblown language, and warn against extrapolating broad claims from a few hand-picked examples.
- Benchmarks like SWE-Bench, aider’s coding leaderboard, LM Arena, etc. are referenced, but differences between top models are seen as incremental, not decisive.
- A recurring theme: for most developers, any major provider’s top model is “good enough”; intelligence feels commoditized, and the real moat is tooling and integration.
- Many remain skeptical of claims that LLMs will soon replace most software engineers; they see them as powerful assistants for well-scoped tasks, but poor at sustained, large-scale, real-world coding without heavy human guidance.