2026-04-25

Amateur armed with ChatGPT solves an Erdős problem

Model capabilities and tiers

Multiple commenters note that the free ChatGPT tier (gpt‑5.4‑mini) feels heavily constrained and more hallucination‑prone, while paid “thinking” models (e.g., 5.5 Pro) can spend 20–80 minutes on a response and are qualitatively different products.
Longer “thinking” is described as expensive inference-time compute, hence gated behind higher‑priced plans and API usage.
Some report that Gemini reaches ~90–95% of ChatGPT Pro’s solution quality with far fewer tokens and much less thinking time.

Nature of the proof and verification

The raw proof from the model was described (in the article and thread) as messy and hard to parse; experts were needed to extract and shorten the core idea.
Several participants stress that formal verification (e.g., in Lean) is much harder than writing an informal proof, and that non‑experts cannot reliably check either the English proof or its formalization.
Others point out that human papers also require significant expert time to verify and often have opaque notation and gaps.

Intelligence, creativity, and “just text prediction”

One camp argues that solving a previously open Erdős problem with a novel technique is strong evidence of real intelligence and creativity, even if produced by a statistical next‑token model.
Another camp insists these models remain “just text generators,” comparing them to calculators or automated theorem provers doing large search, and invoking shifting definitions of intelligence.
There is debate over whether applying a known formula in a new context counts as creativity; many argue that cross‑domain recombination is precisely what a lot of human “creative” work is.

Brute force, reasoning, and prompts

Some attribute success to a kind of powerful “brute force educated guessing” over a huge learned corpus; others reject the brute‑force characterization, emphasizing visible hypothesis‑driven reasoning.
Prompt phrasing (e.g., “don’t search the internet,” “non‑trivial, creative and novel proofs”) is suspected to significantly shape the model’s search behavior; prompt sensitivity is widely acknowledged.
Because outputs are stochastic, it’s unclear whether earlier models could have solved the problem but didn’t under prior prompts.

Cost, access, and democratization

Commenters worry about the token and energy cost of such long runs and whether only well‑funded actors will benefit as models scale.
Others counter that going from needing a top specialist to needing a motivated amateur plus a $100–$200/month tool is already a major democratization, though global affordability remains contentious.

Impact on mathematics and tooling

Many see LLMs as promising “weird collaborators” that can propose unconventional approaches and cross‑apply techniques between subfields.
There is interest in math‑specific “harnesses” that combine LLMs with tools like Python, Sage, and Lean, and in systematically running new models against curated lists of unsolved “dry lab” problems.
Some caution that prior “AI solved an Erdős problem” claims have later reduced to rediscoveries, so formal and community verification remains essential.

Related topics