Some critical issues with the SWE-bench dataset
Benchmark Leakage and Reliability
- Discussion centers on the claim that ~33% of SWE-Bench “successes” are actually cases where the solution is stated or heavily hinted in the issue/comments, inflating pass rates (e.g., 12% → ~4%).
- Many see this as confirming that public benchmarks are being gamed or at least unintentionally measuring “training set lookup” rather than genuine problem solving.
- Others push back:
- The official SWE-Bench authors say the
hints_textfield is not used for leaderboard runs, so some leakage claims may depend on non-standard usage. - Some example “bad” evaluations in the new paper look wrong: AI patches appear functionally equivalent to human ones, or differences are purely stylistic.
- The official SWE-Bench authors say the
Real-World Coding vs Benchmark Claims
- Many commenters say the reduced pass rates match their lived experience: models are decent helpers but poor autonomous coders on non-trivial, unique codebases.
- LLMs are seen as strong on: boilerplate, training loops, CRUD/front-end, small tasks, and working with popular stacks.
- They perform poorly on: niche/novel problems, complex refactors, integration into large legacy systems, and tasks where no online pattern exists.
Agentic Coders vs Autocomplete Tools
- Strong preference expressed for “AI Intellisense” (e.g., Copilot-style inline completion) over fully agentic tools (Cursor, Devin-like systems).
- Reason: tight feedback loop and minimal prompting vs agents that wander off generating large, error-prone diffs.
- Several claim 50–80% of their typed code is AI-completed in certain domains; others report <10% useful output, highlighting variability by domain, skill, and workflow.
Designing Better Benchmarks
- Suggestions:
- Periodic, versioned, post-training datasets from fresh GitHub issues, with strong tests.
- Private / team-specific eval suites, more like interview question sets, manually judged.
- Concerns raised about volunteer labor, data being absorbed into training corpora, and vendor cheating if benchmarks are public.
Critiques of the Paper Itself
- Some commenters argue the new paper mischaracterizes SWE-Bench (e.g., claiming
hints_textis used) and mislabels correct AI patches as incorrect. - A few conclude the paper’s own errors are severe enough that its conclusions should be treated cautiously, even though the general problem of benchmark contamination is acknowledged as real.
Broader Reflections
- Multiple references to Goodhart’s Law: once benchmarks become marketing targets, they stop being reliable measures.
- Widespread skepticism toward headline claims like “PhD-level reasoning,” given mediocre performance on everyday coding and reasoning tasks.