SWE-bench Verified no longer measures frontier coding capabilities

Localization / UX Tangent

  • Several comments complain about forced automatic translation of the OpenAI page and other apps, with no obvious way to switch back to English.
  • People argue users should always be able to override language, independent of headers or IP-based geolocation.

Flaws in SWE-bench Verified

  • OpenAI’s audit claims a large fraction of frequently-failed tasks have flawed or overly strict tests (e.g., requiring specific function names or implementations not specified in the task).
  • Commenters calculate this implies roughly one in six tasks is problematic, which many see as “extraordinarily” high.
  • Others note this is still acceptable if the benchmark’s main job is ranking models relative to each other, not providing an absolute quality measure.

Contamination and Goalpost Moving

  • Many highlight contamination: models can reproduce exact bug fixes or problem statements, proving benchmark items are in training data.
  • Some see OpenAI’s move away from SWE-bench Verified as “moving the goalposts”; others say that’s inevitable and healthy once a benchmark is saturated or compromised.
  • There is skepticism about all major labs avoiding benchmark leakage, given strong marketing incentives.

Interpreting High Scores (Opus, Mythos, etc.)

  • Tension arises because some models report ~90%+ scores on a benchmark OpenAI now calls unreliable.
  • Possible explanations discussed: contamination, benchmark-specific optimization (“benchmaxxing”), better “guessing” of repo-specific styles, or issues with OpenAI’s own audit.
  • Several note you can’t distinguish recall from reasoning at very high scores; a model at 93% vs 90% might just memorize more.

Broader Benchmark Critiques

  • Benchmarks are seen as narrow, easily gamed, and quickly outdated once public; comparisons to ImageNet mislabels, database “benchmarketing,” and SPEC.
  • Many advocate private or “blind” benchmarks, rotating or Olympiad-style test sets, or domain-specific hidden suites.
  • Others argue static pass/fail coding tests miss what matters: integration into real workflows, long-context robustness, and agentic behavior.

Alternatives and New Directions

  • Mentioned alternatives include SWE-bench Pro and follow-ons, ARC-AGI, game-based tests (Zork, StarCraft/Go), agent benchmarks, and third-party suites (e.g., coding/problem-solving evals).
  • Some report stagnation in real-world coding quality despite large benchmark gains, suggesting overfitting to tests rather than genuine capability growth.