Amateur armed with ChatGPT solves an Erdős problem

Model capabilities and tiers

  • Multiple commenters note that the free ChatGPT tier (gpt‑5.4‑mini) feels heavily constrained and more hallucination‑prone, while paid “thinking” models (e.g., 5.5 Pro) can spend 20–80 minutes on a response and are qualitatively different products.
  • Longer “thinking” is described as expensive inference-time compute, hence gated behind higher‑priced plans and API usage.
  • Some report that Gemini reaches ~90–95% of ChatGPT Pro’s solution quality with far fewer tokens and much less thinking time.

Nature of the proof and verification

  • The raw proof from the model was described (in the article and thread) as messy and hard to parse; experts were needed to extract and shorten the core idea.
  • Several participants stress that formal verification (e.g., in Lean) is much harder than writing an informal proof, and that non‑experts cannot reliably check either the English proof or its formalization.
  • Others point out that human papers also require significant expert time to verify and often have opaque notation and gaps.

Intelligence, creativity, and “just text prediction”

  • One camp argues that solving a previously open Erdős problem with a novel technique is strong evidence of real intelligence and creativity, even if produced by a statistical next‑token model.
  • Another camp insists these models remain “just text generators,” comparing them to calculators or automated theorem provers doing large search, and invoking shifting definitions of intelligence.
  • There is debate over whether applying a known formula in a new context counts as creativity; many argue that cross‑domain recombination is precisely what a lot of human “creative” work is.

Brute force, reasoning, and prompts

  • Some attribute success to a kind of powerful “brute force educated guessing” over a huge learned corpus; others reject the brute‑force characterization, emphasizing visible hypothesis‑driven reasoning.
  • Prompt phrasing (e.g., “don’t search the internet,” “non‑trivial, creative and novel proofs”) is suspected to significantly shape the model’s search behavior; prompt sensitivity is widely acknowledged.
  • Because outputs are stochastic, it’s unclear whether earlier models could have solved the problem but didn’t under prior prompts.

Cost, access, and democratization

  • Commenters worry about the token and energy cost of such long runs and whether only well‑funded actors will benefit as models scale.
  • Others counter that going from needing a top specialist to needing a motivated amateur plus a $100–$200/month tool is already a major democratization, though global affordability remains contentious.

Impact on mathematics and tooling

  • Many see LLMs as promising “weird collaborators” that can propose unconventional approaches and cross‑apply techniques between subfields.
  • There is interest in math‑specific “harnesses” that combine LLMs with tools like Python, Sage, and Lean, and in systematically running new models against curated lists of unsolved “dry lab” problems.
  • Some caution that prior “AI solved an Erdős problem” claims have later reduced to rediscoveries, so formal and community verification remains essential.