2026-04-26

SWE-bench Verified no longer measures frontier coding capabilities

Localization / UX Tangent

Several comments complain about forced automatic translation of the OpenAI page and other apps, with no obvious way to switch back to English.
People argue users should always be able to override language, independent of headers or IP-based geolocation.

Flaws in SWE-bench Verified

OpenAI’s audit claims a large fraction of frequently-failed tasks have flawed or overly strict tests (e.g., requiring specific function names or implementations not specified in the task).
Commenters calculate this implies roughly one in six tasks is problematic, which many see as “extraordinarily” high.
Others note this is still acceptable if the benchmark’s main job is ranking models relative to each other, not providing an absolute quality measure.

Contamination and Goalpost Moving

Many highlight contamination: models can reproduce exact bug fixes or problem statements, proving benchmark items are in training data.
Some see OpenAI’s move away from SWE-bench Verified as “moving the goalposts”; others say that’s inevitable and healthy once a benchmark is saturated or compromised.
There is skepticism about all major labs avoiding benchmark leakage, given strong marketing incentives.

Interpreting High Scores (Opus, Mythos, etc.)

Tension arises because some models report ~90%+ scores on a benchmark OpenAI now calls unreliable.
Possible explanations discussed: contamination, benchmark-specific optimization (“benchmaxxing”), better “guessing” of repo-specific styles, or issues with OpenAI’s own audit.
Several note you can’t distinguish recall from reasoning at very high scores; a model at 93% vs 90% might just memorize more.

Broader Benchmark Critiques

Benchmarks are seen as narrow, easily gamed, and quickly outdated once public; comparisons to ImageNet mislabels, database “benchmarketing,” and SPEC.
Many advocate private or “blind” benchmarks, rotating or Olympiad-style test sets, or domain-specific hidden suites.
Others argue static pass/fail coding tests miss what matters: integration into real workflows, long-context robustness, and agentic behavior.

Alternatives and New Directions

Mentioned alternatives include SWE-bench Pro and follow-ons, ARC-AGI, game-based tests (Zork, StarCraft/Go), agent benchmarks, and third-party suites (e.g., coding/problem-solving evals).
Some report stagnation in real-world coding quality despite large benchmark gains, suggesting overfitting to tests rather than genuine capability growth.

Related topics