SWE-bench Verified no longer measures frontier coding capabilities
Localization / UX Tangent
- Several comments complain about forced automatic translation of the OpenAI page and other apps, with no obvious way to switch back to English.
- People argue users should always be able to override language, independent of headers or IP-based geolocation.
Flaws in SWE-bench Verified
- OpenAI’s audit claims a large fraction of frequently-failed tasks have flawed or overly strict tests (e.g., requiring specific function names or implementations not specified in the task).
- Commenters calculate this implies roughly one in six tasks is problematic, which many see as “extraordinarily” high.
- Others note this is still acceptable if the benchmark’s main job is ranking models relative to each other, not providing an absolute quality measure.
Contamination and Goalpost Moving
- Many highlight contamination: models can reproduce exact bug fixes or problem statements, proving benchmark items are in training data.
- Some see OpenAI’s move away from SWE-bench Verified as “moving the goalposts”; others say that’s inevitable and healthy once a benchmark is saturated or compromised.
- There is skepticism about all major labs avoiding benchmark leakage, given strong marketing incentives.
Interpreting High Scores (Opus, Mythos, etc.)
- Tension arises because some models report ~90%+ scores on a benchmark OpenAI now calls unreliable.
- Possible explanations discussed: contamination, benchmark-specific optimization (“benchmaxxing”), better “guessing” of repo-specific styles, or issues with OpenAI’s own audit.
- Several note you can’t distinguish recall from reasoning at very high scores; a model at 93% vs 90% might just memorize more.
Broader Benchmark Critiques
- Benchmarks are seen as narrow, easily gamed, and quickly outdated once public; comparisons to ImageNet mislabels, database “benchmarketing,” and SPEC.
- Many advocate private or “blind” benchmarks, rotating or Olympiad-style test sets, or domain-specific hidden suites.
- Others argue static pass/fail coding tests miss what matters: integration into real workflows, long-context robustness, and agentic behavior.
Alternatives and New Directions
- Mentioned alternatives include SWE-bench Pro and follow-ons, ARC-AGI, game-based tests (Zork, StarCraft/Go), agent benchmarks, and third-party suites (e.g., coding/problem-solving evals).
- Some report stagnation in real-world coding quality despite large benchmark gains, suggesting overfitting to tests rather than genuine capability growth.