Amateur armed with ChatGPT solves an Erdős problem
Model capabilities and tiers
- Multiple commenters note that the free ChatGPT tier (gpt‑5.4‑mini) feels heavily constrained and more hallucination‑prone, while paid “thinking” models (e.g., 5.5 Pro) can spend 20–80 minutes on a response and are qualitatively different products.
- Longer “thinking” is described as expensive inference-time compute, hence gated behind higher‑priced plans and API usage.
- Some report that Gemini reaches ~90–95% of ChatGPT Pro’s solution quality with far fewer tokens and much less thinking time.
Nature of the proof and verification
- The raw proof from the model was described (in the article and thread) as messy and hard to parse; experts were needed to extract and shorten the core idea.
- Several participants stress that formal verification (e.g., in Lean) is much harder than writing an informal proof, and that non‑experts cannot reliably check either the English proof or its formalization.
- Others point out that human papers also require significant expert time to verify and often have opaque notation and gaps.
Intelligence, creativity, and “just text prediction”
- One camp argues that solving a previously open Erdős problem with a novel technique is strong evidence of real intelligence and creativity, even if produced by a statistical next‑token model.
- Another camp insists these models remain “just text generators,” comparing them to calculators or automated theorem provers doing large search, and invoking shifting definitions of intelligence.
- There is debate over whether applying a known formula in a new context counts as creativity; many argue that cross‑domain recombination is precisely what a lot of human “creative” work is.
Brute force, reasoning, and prompts
- Some attribute success to a kind of powerful “brute force educated guessing” over a huge learned corpus; others reject the brute‑force characterization, emphasizing visible hypothesis‑driven reasoning.
- Prompt phrasing (e.g., “don’t search the internet,” “non‑trivial, creative and novel proofs”) is suspected to significantly shape the model’s search behavior; prompt sensitivity is widely acknowledged.
- Because outputs are stochastic, it’s unclear whether earlier models could have solved the problem but didn’t under prior prompts.
Cost, access, and democratization
- Commenters worry about the token and energy cost of such long runs and whether only well‑funded actors will benefit as models scale.
- Others counter that going from needing a top specialist to needing a motivated amateur plus a $100–$200/month tool is already a major democratization, though global affordability remains contentious.
Impact on mathematics and tooling
- Many see LLMs as promising “weird collaborators” that can propose unconventional approaches and cross‑apply techniques between subfields.
- There is interest in math‑specific “harnesses” that combine LLMs with tools like Python, Sage, and Lean, and in systematically running new models against curated lists of unsolved “dry lab” problems.
- Some caution that prior “AI solved an Erdős problem” claims have later reduced to rediscoveries, so formal and community verification remains essential.