2025-02-21

Some critical issues with the SWE-bench dataset

Benchmark Leakage and Reliability

Discussion centers on the claim that ~33% of SWE-Bench “successes” are actually cases where the solution is stated or heavily hinted in the issue/comments, inflating pass rates (e.g., 12% → ~4%).
Many see this as confirming that public benchmarks are being gamed or at least unintentionally measuring “training set lookup” rather than genuine problem solving.
Others push back:
- The official SWE-Bench authors say the hints_text field is not used for leaderboard runs, so some leakage claims may depend on non-standard usage.
- Some example “bad” evaluations in the new paper look wrong: AI patches appear functionally equivalent to human ones, or differences are purely stylistic.

Real-World Coding vs Benchmark Claims

Many commenters say the reduced pass rates match their lived experience: models are decent helpers but poor autonomous coders on non-trivial, unique codebases.
LLMs are seen as strong on: boilerplate, training loops, CRUD/front-end, small tasks, and working with popular stacks.
They perform poorly on: niche/novel problems, complex refactors, integration into large legacy systems, and tasks where no online pattern exists.

Agentic Coders vs Autocomplete Tools

Strong preference expressed for “AI Intellisense” (e.g., Copilot-style inline completion) over fully agentic tools (Cursor, Devin-like systems).
Reason: tight feedback loop and minimal prompting vs agents that wander off generating large, error-prone diffs.
Several claim 50–80% of their typed code is AI-completed in certain domains; others report <10% useful output, highlighting variability by domain, skill, and workflow.

Designing Better Benchmarks

Suggestions:
- Periodic, versioned, post-training datasets from fresh GitHub issues, with strong tests.
- Private / team-specific eval suites, more like interview question sets, manually judged.
Concerns raised about volunteer labor, data being absorbed into training corpora, and vendor cheating if benchmarks are public.

Critiques of the Paper Itself

Some commenters argue the new paper mischaracterizes SWE-Bench (e.g., claiming hints_text is used) and mislabels correct AI patches as incorrect.
A few conclude the paper’s own errors are severe enough that its conclusions should be treated cautiously, even though the general problem of benchmark contamination is acknowledged as real.

Broader Reflections

Multiple references to Goodhart’s Law: once benchmarks become marketing targets, they stop being reliable measures.
Widespread skepticism toward headline claims like “PhD-level reasoning,” given mediocre performance on everyday coding and reasoning tasks.

Related topics