2026-03-12

Are LLM merge rates not getting better?

Overall improvement vs. “no progress” claim

Many commenters say coding LLMs have clearly improved over the last 1–3 years: fewer edits, better refactoring, less duplicated code, more idiomatic output, and the ability to complete substantial features or apps.
Others feel core coding ability is roughly flat over the last year: they still must deeply review, debug, and iterate; perceived gains are smaller and more incremental.
Several note that “plateau in last few months” is plausible, but not “no improvement for a year,” especially given anecdotal step-changes between older and newer frontier models.

Missing models and data issues

A common criticism of the article is that it omits widely reported strong models (e.g., recent Claude/Opus versions, Gemini) and compresses all OpenAI systems into a single point.
With only a handful of heterogeneous data points (different labs, sizes, harnesses), many argue it’s invalid to fit simple linear or step functions and then generalize.

Benchmarks, merge rates, and eval design

Commenters question whether “one-shot PR merge rate” is a good proxy for usefulness: it’s a very strict metric even for humans and sensitive to small errors or style mismatches.
Some highlight the “emergent ability mirage”: staircase-looking curves can arise when a single failure kills an entire task; granular sub-metrics would be more informative.
Others note that as models improve, users attempt more ambitious tasks, which can keep raw success rates flat.
There is debate over the paper’s statistical treatment (cross-validation, Brier scores, ANOVA) and whether removing categories or combining models is methodologically sound.

Tooling, agents, and harness vs. intrinsic model gains

Many say the biggest recent gains come from agentic tooling (Claude Code, Codex-like systems, IDE/CLI integration, tool use, planning loops, sub-agents), not just raw model IQ.
Improved context management, self-checks, and auto-testing make coding “feel” dramatically better even if one-shot benchmarks move slowly.

Trust, reliability, and limits

LLMs still hallucinate, need supervision, and can make catastrophic mistakes (e.g., mis-editing infra configs).
Trust and accountability are seen as major blockers in regulated or high-stakes domains; “LLM + human” is viewed as the realistic pattern for now.
Several expect diminishing returns from scaling and see future gains mainly from better harnesses, workflows, and cost optimization.

Related topics