OpenAI claims gold-medal performance at IMO 2025
Nature of the achievement
- Thread centers on OpenAI’s claim that an experimental model achieved a gold‑medal–level score on IMO 2025 by solving Problems 1–5 in natural language within contest time limits.
- Many see this as a major capability jump relative to earlier public results, where top models scored well below bronze on the same problem set.
Reasoning vs “smart retrieval”
- One camp argues this is still “just” sophisticated pattern matching over Internet-scale data, not genuine reasoning.
- Others counter that, even if it were only “smart retrieval,” most real-world expert work (medicine, law, finance, software) is already largely protocol/pattern application, so the societal impact is still huge.
- Several note that whether this counts as “real reasoning” is more a philosophical than practical question.
Difficulty and meaning of IMO performance
- Multiple commenters push back on claims that these are “just high-school problems,” stressing IMO problems are extraordinarily hard and unlike routine coursework; even many professional mathematicians without olympiad background struggle with them.
- Some warn against goalpost-moving: IMO was widely cited as a “too hard for LLMs” benchmark; now that it’s reached, people downplay its significance.
Methodology, transparency, and trust
- Big concern: opacity. No full methodology, compute budget, or ablation details are published; the model is unreleased and described only via tweets.
- Questions raised:
- Was the model specialized or heavily fine‑tuned for olympiad math (vs a general model)?
- Was there any data leakage (training on 2025 problems/solutions or near-duplicates)?
- How much test-time compute, how many parallel samples, and who did the “cherry-picking” (humans vs model)?
- Prior controversies around benchmarks and undisclosed conflicts of interest fuel skepticism about taking OpenAI’s claims at face value, even among people impressed by the raw proofs.
Predictions, probabilities, and goalposts
- Discussion recalls earlier public bets that an IMO gold by 2025 was relatively unlikely; probabilities in the single digits or low tens are debated.
- Long subthread on “rationalist” habit of assigning precise percentages to future events, calibration, Brier scores, and whether such numerical forecasts are meaningful or misleading when we only see one timeline.
- Many note a pattern: each time AI clears a bar once thought far off, commentary shifts to why that bar was “actually not that meaningful.”
Broader impact and limits of current LLMs
- Some highlight that public models still fail on much simpler math and coding tasks; gold at IMO does not mean robust everyday reliability.
- Others see this as evidence that we’re still on a steep improvement curve, especially in test‑time “thinking” and RL-based reasoning, with potential for serious contribution to scientific discovery.
- A sizable group worries more about how such capabilities will be weaponized (economic disruption, surveillance, military uses) than about the technical feat itself.