Many SWE-bench-Passing PRs would not be merged

Limits of SWE-bench and test-based evals

  • Many argue SWE-bench mainly measures “does it make tests pass,” not “would a maintainer merge this.”
  • Tests miss spec/intent alignment, scope creep, architectural fit, style, and team risk tolerance.
  • Some note that even on SWE-bench, a significant fraction of test-passing changes still don’t actually solve the intended issue.
  • Passing benchmarks is seen as a weak, directional signal at best, not a proxy for real-world usefulness.

Quality of LLM-generated code vs correctness

  • Common experience: models produce code that works but is verbose, convoluted, and hard to maintain.
  • Behavior resembles an over-eager junior engineer: will do anything to satisfy tests, often adding complexity instead of refactoring.
  • Users often shrink LLM-produced code to a fraction of its size while improving clarity.
  • Good results typically require strong human steering, planning, and review; without that, “green CI” hides landmines.

Benchmarks, progress, and gaming

  • Some see a clear capability trend; others think the article suggests a plateau or that improvements are artifact of overfitting to benchmarks.
  • Concern that public benchmarks get baked into training data and optimized for, inflating scores.
  • Suggestions: aggregate scores across many evals over time; alternative metrics like diff size, abstraction depth, new dependencies.

Repo-specific and structural evaluations

  • Several participants are building “evals for your repo” that compare agent output to original PRs, check code quality, and enforce local patterns.
  • Ideas for structural signals: cyclomatic complexity, codebase “entropy,” diff size, AST-based complexity, and custom lints enforcing architecture.
  • Tests are still needed, but seen as only one dimension alongside style and design constraints.

Human workflows, psychology, and tooling

  • Maintainers face rising noise from AI-generated PRs; they are not obligated to review everything and may silently ignore low-quality submissions.
  • Some see prejudice against any AI-assisted code; others emphasize that weak PR descriptions and obvious “slop” justify rejection.
  • Pragmatic workflows include: multi-model pipelines (generate then simplify), custom linters, explicit architecture docs, and even tests that fail on disallowed patterns.

Long-term concerns

  • Fears of a future filled with inscrutable, agent-maintained codebases are voiced, but others argue humans can still understand very messy code and that current gains may be plateauing.