Gemini 2.5 Pro vs. Claude 3.7 Sonnet: Coding Comparison

Access, Pricing, and Context Window

  • Gemini 2.5 Pro is now available free via Google’s web interface and AI Studio; some users note region or app limitations and mention Google One “AI Premium” as another route.
  • The 1M-token context window draws attention; people distinguish between theoretical max vs effective recall.
  • Several report Gemini models doing very well on “needle in a haystack” retrieval, but others question whether vendor benchmarks are independently replicated and note more challenging long-context tests.

Coding Quality: Gemini vs Claude vs Others

  • Experiences are highly split:
    • Some find Gemini 2.5 Pro strongest at writing new code from scratch and complex reasoning over code, including finding subtle threading/logic issues.
    • Many others say it’s worse than Claude 3.5/3.7 (and far behind o1 Pro) for everyday coding and large refactors, especially on 50k+ token prompts.
  • Common Gemini complaints: touches unrelated code, aggressive refactors, obsesses over “advanced” changes (e.g., removing GIL, OpenMP, optimizations that hurt common cases), omits subroutines or replaces them with stubs/comments, or writes “same as before” instead of full code.
  • Claude 3.7 is often described as more agentic but less obedient than 3.5, prone to overediting, chasing linter issues, or rewriting whole modules when only a move/rename was requested.
  • o1 Pro is widely regarded as best for hard debugging, but too expensive for many.

Greenfield Demos vs Real Projects

  • Many criticize the article’s tests as “toy” greenfield tasks (games, small apps) that any strong model can handle.
  • Multiple commenters say the real challenge is modifying large, messy existing codebases, respecting constraints, and not exploding tech debt.
  • Several demand benchmarks that involve adding features or ports in real OSS projects (e.g., porting a GTK3 UI layer to GTK4), with one maintainer explicitly offering such a task as a “can LLMs really code?” benchmark.

Tooling, Prompts, and Temperature

  • Results vary strongly with tooling:
    • Claude Code, Cursor, Windsurf, Aider, Cline, Roo, and MCP-based setups all get mentioned; some tools seem tuned for Claude and underutilize Gemini.
    • Users suggest diff-only/system prompts, low temperature (~0–0.4) for reliable edits, and using AI mainly as an “intern + reviewer” rather than for full rewrites.
  • Feeding up-to-date docs and repo “flattening” scripts is reported to dramatically improve behavior, especially for non-mainstream APIs and libraries.

Safety, Refusals, and Model Personality

  • Gemini sometimes refuses risky or “sloppy” solutions (SQL DELETEs, insecure networking, routing hacks), even ending a session with firm disclaimers.
  • Some appreciate this pushback as more honest than models that “yes-man” bad ideas; others see it as overbearing and want an override.

Hype, Benchmarks, and Overall View

  • Several call the blog post biased marketing, note overblown language, and warn against extrapolating broad claims from a few hand-picked examples.
  • Benchmarks like SWE-Bench, aider’s coding leaderboard, LM Arena, etc. are referenced, but differences between top models are seen as incremental, not decisive.
  • A recurring theme: for most developers, any major provider’s top model is “good enough”; intelligence feels commoditized, and the real moat is tooling and integration.
  • Many remain skeptical of claims that LLMs will soon replace most software engineers; they see them as powerful assistants for well-scoped tasks, but poor at sustained, large-scale, real-world coding without heavy human guidance.