2024-07-25

AI solves International Math Olympiad problems at silver medal level

System design & methodology

Problems were manually translated into Lean 4; the core system (AlphaProof) couples a pre‑trained language model with AlphaZero‑style reinforcement learning and Monte Carlo Tree Search over Lean proof states.
A separate geometry system (AlphaGeometry 2) combines an LLM with a symbolic deduction engine and solved the geometry problem in ~19 seconds.
During training and even during the contest, the system generated synthetic variants of problems and trained on proofs/disproofs of these, using the proof checker as a reward signal.
Formal proofs are verified by Lean, so end results are mechanically checkable and non‑hallucinated.

Performance vs IMO rules

System solved 4/6 problems for 28/42 points, described as “silver medal level”.
Humans get 2 × 4.5‑hour sessions; the system had up to ~3 days per problem and massive compute. One problem was solved “within minutes”.
Many commenters argue that on an apples‑to‑apples timing and tooling basis it would not medal; others say the core result is that these problems are now solvable at all by machines.

Role and difficulty of formalization

Input statements were hand‑formalized; answers (e.g., which numbers/sets satisfy conditions) were discovered by the system, then proved.
There is disagreement on whether formalization is “much easier” than solving; some say any trained student can do it quickly, others report specific problems (like P5) are quite hard to formalize.
An autoformalizer based on Gemini exists and was used to generate large training corpora, but is not yet reliable enough for high‑stakes input; there is no general way to automatically verify that a formalization matches the informal problem.

Search, reasoning, and “intelligence”

Many see this as another victory for search + learning (AlphaZero‑style) over huge spaces, not pure brute force.
Debate centers on whether this is “just” guided search vs genuine reasoning; some equate this with how human mathematical research feels (groping in the dark, many failed paths).
Combinatorics problems remained unsolved; several speculate these are harder for current methods, more akin to general reasoning tasks.

Implications for math and software

Strong optimism that such systems will become powerful mathematical assistants: checking arguments, exploring variations, automating tedious inequalities, and eventually tackling some open problems.
Similar techniques are seen as promising for verified software (concurrency protocols, compiler correctness, program synthesis from specs).

Concerns, limitations, and skepticism

Multiple commenters criticize marketing: “silver medal” framing ignores time/computational asymmetry and heavy manual formalization.
Worries about AI hype, media overclaiming (“move over mathematicians”), and DeepMind’s closed, non‑reproducible systems.
Energy and CO₂ cost is raised; some want transparent reporting of training and inference footprints.
Others counter that inefficiency and long runtimes are typical for first breakthroughs and will likely improve dramatically.

Related topics