Epoch confirms GPT5.4 Pro solved a frontier math open problem

What the models did and how

  • Multiple frontier models (GPT-5.4 variants, Gemini 3.1 Pro, Opus 4.6) solved the same hypergraph Ramsey open problem once a “scaffold” was built.
  • “Scaffold” is discussed as a harness of agents, tools, prompts, auto-critique, and search over many attempts, not a single raw prompt-and-reply.
  • The specific solved problem is categorized by the project as “moderately interesting” among open problems, with an expert-estimated difficulty of 1–3 months of work.

How impressive is this result?

  • Enthusiastic commenters see it as a qualitative step beyond contest math, showing current models can contribute to open research problems and likely will unlock many more results.
  • Skeptics note the problem is niche, had few human workers, and the solution resembles a relatively simple combinatorial construction implemented in Python; they compare it to building yet another small program.
  • Some argue that because similar techniques exist in training data, this is more “automation of a known style of proof” than a dramatic conceptual breakthrough.

Novelty vs “remixing”

  • One camp insists LLMs merely remix training data, can’t originate truly new ideas, and that all apparent novelty is recombination or reflection of hidden training examples.
  • Others counter that most human research is also recombination of existing ideas, that this still yields genuinely new results, and that demanding “non-remix” novelty would exclude almost all human work too.
  • Debate emerges over whether “everything is a remix” already, and whether “novel” is being turned into an unfalsifiable standard.

Brute force vs intelligence

  • Some claim models just brute-force search (“try every solution until one works”).
  • Others point out that exhaustive search is infeasible here per the writeup, and that humans also rely heavily on trial-and-error plus heuristics; the distinction is blurry.

Reliability, limitations, and goalposts

  • Many highlight that the same systems still fail at basic tasks (multi-digit arithmetic, counting letters in “strawberry”, navigating codebases), questioning whether solving isolated math problems proves broad intelligence.
  • There are concerns about unverifiable training data, potential seeding of solutions, hype, and lack of transparency in RL/benchmark design.
  • Several note persistent “goalpost moving”: from “LLMs can’t do open problems” to “they can, but only if humans pose/evaluate them” to “they can’t pose interesting new problems themselves.”

Broader implications

  • Optimists expect major acceleration in “yeoman’s work” of math: grinding out bounds, converting conjectures to theorems, and assisting formal proof systems.
  • Others stress that many real-world and political problems lack clean value functions or cheap verification, so math success doesn’t automatically translate to societal problem-solving or AGI.