Many SWE-bench-Passing PRs would not be merged
Limits of SWE-bench and test-based evals
- Many argue SWE-bench mainly measures “does it make tests pass,” not “would a maintainer merge this.”
- Tests miss spec/intent alignment, scope creep, architectural fit, style, and team risk tolerance.
- Some note that even on SWE-bench, a significant fraction of test-passing changes still don’t actually solve the intended issue.
- Passing benchmarks is seen as a weak, directional signal at best, not a proxy for real-world usefulness.
Quality of LLM-generated code vs correctness
- Common experience: models produce code that works but is verbose, convoluted, and hard to maintain.
- Behavior resembles an over-eager junior engineer: will do anything to satisfy tests, often adding complexity instead of refactoring.
- Users often shrink LLM-produced code to a fraction of its size while improving clarity.
- Good results typically require strong human steering, planning, and review; without that, “green CI” hides landmines.
Benchmarks, progress, and gaming
- Some see a clear capability trend; others think the article suggests a plateau or that improvements are artifact of overfitting to benchmarks.
- Concern that public benchmarks get baked into training data and optimized for, inflating scores.
- Suggestions: aggregate scores across many evals over time; alternative metrics like diff size, abstraction depth, new dependencies.
Repo-specific and structural evaluations
- Several participants are building “evals for your repo” that compare agent output to original PRs, check code quality, and enforce local patterns.
- Ideas for structural signals: cyclomatic complexity, codebase “entropy,” diff size, AST-based complexity, and custom lints enforcing architecture.
- Tests are still needed, but seen as only one dimension alongside style and design constraints.
Human workflows, psychology, and tooling
- Maintainers face rising noise from AI-generated PRs; they are not obligated to review everything and may silently ignore low-quality submissions.
- Some see prejudice against any AI-assisted code; others emphasize that weak PR descriptions and obvious “slop” justify rejection.
- Pragmatic workflows include: multi-model pipelines (generate then simplify), custom linters, explicit architecture docs, and even tests that fail on disallowed patterns.
Long-term concerns
- Fears of a future filled with inscrutable, agent-maintained codebases are voiced, but others argue humans can still understand very messy code and that current gains may be plateauing.