Are LLM merge rates not getting better?

Overall improvement vs. “no progress” claim

  • Many commenters say coding LLMs have clearly improved over the last 1–3 years: fewer edits, better refactoring, less duplicated code, more idiomatic output, and the ability to complete substantial features or apps.
  • Others feel core coding ability is roughly flat over the last year: they still must deeply review, debug, and iterate; perceived gains are smaller and more incremental.
  • Several note that “plateau in last few months” is plausible, but not “no improvement for a year,” especially given anecdotal step-changes between older and newer frontier models.

Missing models and data issues

  • A common criticism of the article is that it omits widely reported strong models (e.g., recent Claude/Opus versions, Gemini) and compresses all OpenAI systems into a single point.
  • With only a handful of heterogeneous data points (different labs, sizes, harnesses), many argue it’s invalid to fit simple linear or step functions and then generalize.

Benchmarks, merge rates, and eval design

  • Commenters question whether “one-shot PR merge rate” is a good proxy for usefulness: it’s a very strict metric even for humans and sensitive to small errors or style mismatches.
  • Some highlight the “emergent ability mirage”: staircase-looking curves can arise when a single failure kills an entire task; granular sub-metrics would be more informative.
  • Others note that as models improve, users attempt more ambitious tasks, which can keep raw success rates flat.
  • There is debate over the paper’s statistical treatment (cross-validation, Brier scores, ANOVA) and whether removing categories or combining models is methodologically sound.

Tooling, agents, and harness vs. intrinsic model gains

  • Many say the biggest recent gains come from agentic tooling (Claude Code, Codex-like systems, IDE/CLI integration, tool use, planning loops, sub-agents), not just raw model IQ.
  • Improved context management, self-checks, and auto-testing make coding “feel” dramatically better even if one-shot benchmarks move slowly.

Trust, reliability, and limits

  • LLMs still hallucinate, need supervision, and can make catastrophic mistakes (e.g., mis-editing infra configs).
  • Trust and accountability are seen as major blockers in regulated or high-stakes domains; “LLM + human” is viewed as the realistic pattern for now.
  • Several expect diminishing returns from scaling and see future gains mainly from better harnesses, workflows, and cost optimization.