2026-03-24

Epoch confirms GPT5.4 Pro solved a frontier math open problem

What the models did and how

Multiple frontier models (GPT-5.4 variants, Gemini 3.1 Pro, Opus 4.6) solved the same hypergraph Ramsey open problem once a “scaffold” was built.
“Scaffold” is discussed as a harness of agents, tools, prompts, auto-critique, and search over many attempts, not a single raw prompt-and-reply.
The specific solved problem is categorized by the project as “moderately interesting” among open problems, with an expert-estimated difficulty of 1–3 months of work.

How impressive is this result?

Enthusiastic commenters see it as a qualitative step beyond contest math, showing current models can contribute to open research problems and likely will unlock many more results.
Skeptics note the problem is niche, had few human workers, and the solution resembles a relatively simple combinatorial construction implemented in Python; they compare it to building yet another small program.
Some argue that because similar techniques exist in training data, this is more “automation of a known style of proof” than a dramatic conceptual breakthrough.

Novelty vs “remixing”

One camp insists LLMs merely remix training data, can’t originate truly new ideas, and that all apparent novelty is recombination or reflection of hidden training examples.
Others counter that most human research is also recombination of existing ideas, that this still yields genuinely new results, and that demanding “non-remix” novelty would exclude almost all human work too.
Debate emerges over whether “everything is a remix” already, and whether “novel” is being turned into an unfalsifiable standard.

Brute force vs intelligence

Some claim models just brute-force search (“try every solution until one works”).
Others point out that exhaustive search is infeasible here per the writeup, and that humans also rely heavily on trial-and-error plus heuristics; the distinction is blurry.

Reliability, limitations, and goalposts

Many highlight that the same systems still fail at basic tasks (multi-digit arithmetic, counting letters in “strawberry”, navigating codebases), questioning whether solving isolated math problems proves broad intelligence.
There are concerns about unverifiable training data, potential seeding of solutions, hype, and lack of transparency in RL/benchmark design.
Several note persistent “goalpost moving”: from “LLMs can’t do open problems” to “they can, but only if humans pose/evaluate them” to “they can’t pose interesting new problems themselves.”

Broader implications

Optimists expect major acceleration in “yeoman’s work” of math: grinding out bounds, converting conjectures to theorems, and assisting formal proof systems.
Others stress that many real-world and political problems lack clean value functions or cheap verification, so math success doesn’t automatically translate to societal problem-solving or AGI.

Related topics