Gemini with Deep Think achieves gold-medal standard at the IMO

Training focus and data

  • Several commenters note that Gemini used a curated corpus of math solutions and IMO-specific “hints,” reading this as targeted training rather than a purely general model.
  • Others point out that humans also train on past IMO-style problems, and that similar problems (not the 2025 statements themselves) are inevitably in pretraining data.
  • There’s speculation, but no clarity, on whether this was a heavily specialized model that might trade off some general-purpose performance.

Compute, cost, and test conditions

  • Many want transparency on compute and dollar cost, both for training and for the 4.5‑hour inference run. In the absence of numbers, several assume it was very expensive.
  • Some discuss energy comparisons (GPUs vs human calories) and joke about “libraries of Congress of text” being generated internally.
  • There’s interest in whether this is the same Deep Think that will ship to users, or a more compute-heavy internal variant.

Official vs self‑graded results and coordination

  • A major thread contrasts Google’s officially graded IMO participation with OpenAI’s self-graded claim.
  • Commenters cite statements from IMO-affiliated people suggesting Google coordinated in advance, respected an embargo, and got official certification; OpenAI did not formally participate and announced as soon as the contest ended, against the organizers’ preference to wait longer.
  • Some see this as normal PR behavior; others criticize it as disrespectful to student contestants and bad scientific practice (self-selected methodology, no preregistration).

Evaluation methodology and what “gold” really means

  • A long reposted essay (by a prominent mathematician) argues that AI results depend hugely on format: time allowed, number of parallel tries, prompting help, and selective reporting.
  • The point: without a standardized, disclosed-in-advance protocol, AI gold scores aren’t directly comparable to human medals or to each other.

Tool use, Lean, and natural-language proofs

  • Google emphasizes “end-to-end natural language” with no external tools or internet. Commenters later find a confirmation that no tool use was allowed during inference.
  • Many are surprised they didn’t use formal provers (e.g., Lean) at inference time; some think training likely exploited formal systems for reward signals, but this is unclear.
  • Strong debate:
    • One camp sees natural-language-only proofs as a major leap in “raw reasoning” and important for general domains.
    • Another camp is disappointed by stepping away from formal proofs, arguing that large, machine-scale mathematics will ultimately require formalization to be trustworthy and checkable.

Quality and style of the proofs

  • Side‑by‑side reading of the released solutions suggests:
    • Gemini’s proofs are more readable, structured prose, but very verbose, spelling out every small step.
    • OpenAI’s are more spartan and mechanical, sometimes harder to follow.
  • Several infer that the visible solutions are post‑hoc summaries, with huge hidden reasoning traces produced under “parallel thinking” / tree‑of‑thoughts‑style search.

Problem 6 and current limits

  • Both systems achieved 35/42 by solving Problems 1–5 and failing on Problem 6.
  • Gemini’s PDF omits Problem 6 entirely, confirming no successful solution; OpenAI’s work also did not solve it.
  • Commenters note that the hardest IMO problem often demands genuine originality; this aligns with AIs still struggling at the top end.
  • There’s curiosity whether the models knew they were stuck (e.g., exhausted time without a coherent proof) or just failed to converge.

Broader significance and comparison to research math

  • Many see this as a “Deep Blue vs prodigy” moment for contest math: extremely impressive but still far from a true AI mathematician.
  • Mathematicians in the thread stress that contest problems and research mathematics are different skills; correlation exists for humans, but it’s unclear whether it will generalize for machines.
  • Some argue that for humans, the real value of a proof is explanatory insight, not just correctness. A thousand-page, formally correct but unintelligible machine proof would be of limited interest unless it introduces new, understandable ideas.

Fairness, meaning of the benchmark, and Tao-style skepticism

  • A recurring theme is that competitions are designed to differentiate humans under specific constraints; AI systems operate under very different regimes (parallelism, arbitrary compute, selective reporting).
  • Some argue this makes “gold medal” claims only loosely comparable to human achievement; success might not correlate with broader reasoning ability the way it does for students.
  • Others respond that, regardless of fairness, demonstrating that current models can produce rigorous solutions to this class of problems is a genuine capability milestone.

Parallel reasoning and search

  • The article’s mention of “parallel thinking” sparks discussion of tree-of-thoughts / multi-trajectory search as a key ingredient.
  • One commenter notes an open-source framework that does parallel reasoning orchestration and suggests similar ideas are already being explored publicly.
  • People are very curious about the actual search depth, branching, and selection mechanisms, but Google hasn’t disclosed them.

Contest design, future of Olympiads, and tools

  • A recent paper is cited showing that an off-the-shelf Gemini 2.5 Pro (with careful prompting and no contamination) can solve IMO 2025 Problems 1–5, suggesting the solution space is small enough for guided search.
  • Some suggest future IMO problem setters should test proposed problems against top LLMs and discard those that are solvable, similar to how tests evolved after calculators.
  • Others foresee a future where students are allowed to use AI tools, just as calculators are now permitted in many exams.

Impact on work, identity, and the “joy of math”

  • A philosophical thread worries that as AI outperforms humans in “being smart,” people whose identity rests on intelligence will struggle, and competitive societies may push everyone to subscribe to AI tools to remain employable.
  • A few lament that turning math into something a machine can “grind through” might drain some of its joy; others counter with analogies to chess, where engines changed but did not destroy human play (though some say high-level classical chess did lose some romance).
  • There’s also optimism that these systems will mainly automate tedious steps between ideas and verification, freeing human mathematicians to focus on conceptual advances.

Product quality and branding chatter

  • Multiple commenters joke about Google’s confusing naming (“advanced Gemini,” “Deep Think”) and riff on overused marketing labels (“pro/plus/ultra”).
  • Experiences with Gemini 2.5 Pro are mixed: some find it behind o3 or Claude in everyday use, citing verbosity and difficulty focusing; others report the opposite, calling Gemini 2.5 reliably strong and o3 “lazy” or unhelpfully terse.
  • A separate thread critiques OpenAI’s general approach as prioritizing hype and attention over transparency, drawing analogies to earlier heavily constrained game demos.