Epoch confirms GPT5.4 Pro solved a frontier math open problem
What the models did and how
- Multiple frontier models (GPT-5.4 variants, Gemini 3.1 Pro, Opus 4.6) solved the same hypergraph Ramsey open problem once a “scaffold” was built.
- “Scaffold” is discussed as a harness of agents, tools, prompts, auto-critique, and search over many attempts, not a single raw prompt-and-reply.
- The specific solved problem is categorized by the project as “moderately interesting” among open problems, with an expert-estimated difficulty of 1–3 months of work.
How impressive is this result?
- Enthusiastic commenters see it as a qualitative step beyond contest math, showing current models can contribute to open research problems and likely will unlock many more results.
- Skeptics note the problem is niche, had few human workers, and the solution resembles a relatively simple combinatorial construction implemented in Python; they compare it to building yet another small program.
- Some argue that because similar techniques exist in training data, this is more “automation of a known style of proof” than a dramatic conceptual breakthrough.
Novelty vs “remixing”
- One camp insists LLMs merely remix training data, can’t originate truly new ideas, and that all apparent novelty is recombination or reflection of hidden training examples.
- Others counter that most human research is also recombination of existing ideas, that this still yields genuinely new results, and that demanding “non-remix” novelty would exclude almost all human work too.
- Debate emerges over whether “everything is a remix” already, and whether “novel” is being turned into an unfalsifiable standard.
Brute force vs intelligence
- Some claim models just brute-force search (“try every solution until one works”).
- Others point out that exhaustive search is infeasible here per the writeup, and that humans also rely heavily on trial-and-error plus heuristics; the distinction is blurry.
Reliability, limitations, and goalposts
- Many highlight that the same systems still fail at basic tasks (multi-digit arithmetic, counting letters in “strawberry”, navigating codebases), questioning whether solving isolated math problems proves broad intelligence.
- There are concerns about unverifiable training data, potential seeding of solutions, hype, and lack of transparency in RL/benchmark design.
- Several note persistent “goalpost moving”: from “LLMs can’t do open problems” to “they can, but only if humans pose/evaluate them” to “they can’t pose interesting new problems themselves.”
Broader implications
- Optimists expect major acceleration in “yeoman’s work” of math: grinding out bounds, converting conjectures to theorems, and assisting formal proof systems.
- Others stress that many real-world and political problems lack clean value functions or cheap verification, so math success doesn’t automatically translate to societal problem-solving or AGI.