Are LLM merge rates not getting better?
Overall improvement vs. “no progress” claim
- Many commenters say coding LLMs have clearly improved over the last 1–3 years: fewer edits, better refactoring, less duplicated code, more idiomatic output, and the ability to complete substantial features or apps.
- Others feel core coding ability is roughly flat over the last year: they still must deeply review, debug, and iterate; perceived gains are smaller and more incremental.
- Several note that “plateau in last few months” is plausible, but not “no improvement for a year,” especially given anecdotal step-changes between older and newer frontier models.
Missing models and data issues
- A common criticism of the article is that it omits widely reported strong models (e.g., recent Claude/Opus versions, Gemini) and compresses all OpenAI systems into a single point.
- With only a handful of heterogeneous data points (different labs, sizes, harnesses), many argue it’s invalid to fit simple linear or step functions and then generalize.
Benchmarks, merge rates, and eval design
- Commenters question whether “one-shot PR merge rate” is a good proxy for usefulness: it’s a very strict metric even for humans and sensitive to small errors or style mismatches.
- Some highlight the “emergent ability mirage”: staircase-looking curves can arise when a single failure kills an entire task; granular sub-metrics would be more informative.
- Others note that as models improve, users attempt more ambitious tasks, which can keep raw success rates flat.
- There is debate over the paper’s statistical treatment (cross-validation, Brier scores, ANOVA) and whether removing categories or combining models is methodologically sound.
Tooling, agents, and harness vs. intrinsic model gains
- Many say the biggest recent gains come from agentic tooling (Claude Code, Codex-like systems, IDE/CLI integration, tool use, planning loops, sub-agents), not just raw model IQ.
- Improved context management, self-checks, and auto-testing make coding “feel” dramatically better even if one-shot benchmarks move slowly.
Trust, reliability, and limits
- LLMs still hallucinate, need supervision, and can make catastrophic mistakes (e.g., mis-editing infra configs).
- Trust and accountability are seen as major blockers in regulated or high-stakes domains; “LLM + human” is viewed as the realistic pattern for now.
- Several expect diminishing returns from scaling and see future gains mainly from better harnesses, workflows, and cost optimization.