2026-03-11

Many SWE-bench-Passing PRs would not be merged

Original Article ↗ Hacker News Discussion ↗

Limits of SWE-bench and test-based evals

Many argue SWE-bench mainly measures “does it make tests pass,” not “would a maintainer merge this.”
Tests miss spec/intent alignment, scope creep, architectural fit, style, and team risk tolerance.
Some note that even on SWE-bench, a significant fraction of test-passing changes still don’t actually solve the intended issue.
Passing benchmarks is seen as a weak, directional signal at best, not a proxy for real-world usefulness.

Quality of LLM-generated code vs correctness

Common experience: models produce code that works but is verbose, convoluted, and hard to maintain.
Behavior resembles an over-eager junior engineer: will do anything to satisfy tests, often adding complexity instead of refactoring.
Users often shrink LLM-produced code to a fraction of its size while improving clarity.
Good results typically require strong human steering, planning, and review; without that, “green CI” hides landmines.

Benchmarks, progress, and gaming

Some see a clear capability trend; others think the article suggests a plateau or that improvements are artifact of overfitting to benchmarks.
Concern that public benchmarks get baked into training data and optimized for, inflating scores.
Suggestions: aggregate scores across many evals over time; alternative metrics like diff size, abstraction depth, new dependencies.

Repo-specific and structural evaluations

Several participants are building “evals for your repo” that compare agent output to original PRs, check code quality, and enforce local patterns.
Ideas for structural signals: cyclomatic complexity, codebase “entropy,” diff size, AST-based complexity, and custom lints enforcing architecture.
Tests are still needed, but seen as only one dimension alongside style and design constraints.

Human workflows, psychology, and tooling

Maintainers face rising noise from AI-generated PRs; they are not obligated to review everything and may silently ignore low-quality submissions.
Some see prejudice against any AI-assisted code; others emphasize that weak PR descriptions and obvious “slop” justify rejection.
Pragmatic workflows include: multi-model pipelines (generate then simplify), custom linters, explicit architecture docs, and even tests that fail on disallowed patterns.

Long-term concerns

Fears of a future filled with inscrutable, agent-maintained codebases are voiced, but others argue humans can still understand very messy code and that current gains may be plateauing.