AI solves International Math Olympiad problems at silver medal level

System design & methodology

  • Problems were manually translated into Lean 4; the core system (AlphaProof) couples a pre‑trained language model with AlphaZero‑style reinforcement learning and Monte Carlo Tree Search over Lean proof states.
  • A separate geometry system (AlphaGeometry 2) combines an LLM with a symbolic deduction engine and solved the geometry problem in ~19 seconds.
  • During training and even during the contest, the system generated synthetic variants of problems and trained on proofs/disproofs of these, using the proof checker as a reward signal.
  • Formal proofs are verified by Lean, so end results are mechanically checkable and non‑hallucinated.

Performance vs IMO rules

  • System solved 4/6 problems for 28/42 points, described as “silver medal level”.
  • Humans get 2 × 4.5‑hour sessions; the system had up to ~3 days per problem and massive compute. One problem was solved “within minutes”.
  • Many commenters argue that on an apples‑to‑apples timing and tooling basis it would not medal; others say the core result is that these problems are now solvable at all by machines.

Role and difficulty of formalization

  • Input statements were hand‑formalized; answers (e.g., which numbers/sets satisfy conditions) were discovered by the system, then proved.
  • There is disagreement on whether formalization is “much easier” than solving; some say any trained student can do it quickly, others report specific problems (like P5) are quite hard to formalize.
  • An autoformalizer based on Gemini exists and was used to generate large training corpora, but is not yet reliable enough for high‑stakes input; there is no general way to automatically verify that a formalization matches the informal problem.

Search, reasoning, and “intelligence”

  • Many see this as another victory for search + learning (AlphaZero‑style) over huge spaces, not pure brute force.
  • Debate centers on whether this is “just” guided search vs genuine reasoning; some equate this with how human mathematical research feels (groping in the dark, many failed paths).
  • Combinatorics problems remained unsolved; several speculate these are harder for current methods, more akin to general reasoning tasks.

Implications for math and software

  • Strong optimism that such systems will become powerful mathematical assistants: checking arguments, exploring variations, automating tedious inequalities, and eventually tackling some open problems.
  • Similar techniques are seen as promising for verified software (concurrency protocols, compiler correctness, program synthesis from specs).

Concerns, limitations, and skepticism

  • Multiple commenters criticize marketing: “silver medal” framing ignores time/computational asymmetry and heavy manual formalization.
  • Worries about AI hype, media overclaiming (“move over mathematicians”), and DeepMind’s closed, non‑reproducible systems.
  • Energy and CO₂ cost is raised; some want transparent reporting of training and inference footprints.
  • Others counter that inefficiency and long runtimes are typical for first breakthroughs and will likely improve dramatically.