2026-01-09

“Erdos problem #728 was solved more or less autonomously by AI”

Scope and nature of the result

Erdős problem #728 was solved via a pipeline combining:
- An informal proof sketched in English.
- Translation and refinement into Lean by Aristotle (a theorem-proving system).
- Back-and-forth with a frontier LLM (ChatGPT 5.2) and automated proof search.
The Lean statement of the theorem was written by humans; the long formal proof (≈1400 lines) was machine-generated and checked by Lean’s small, trusted kernel.
Autonomy is debated: some see this as “90/10” AI/human on the proof itself; others stress the human role in posing the problem, checking the statement, iterating prompts, and interpreting gaps.

What Aristotle / the AI stack actually is

Aristotle integrates:
- A Lean proof search system guided by a large transformer (policy/value model).
- An informal reasoning component that generates and formalizes lemmas.
- A geometry solver inspired by AlphaGeometry.
Thread participants argue over terminology:
- One side: this is “just” LLMs plus tools (all major stages use large language models).
- Other side: calling it “an LLM” is misleading; it’s a task-specific neuro‑symbolic system trained on formal math, with custom losses and search, more like AlphaFold than a chat model.
Consensus: whatever the label, the key advance is tight coupling of powerful models with formal verification.

Significance for mathematics and theorem proving

Seen as a clear capability jump in:
- Rapid refactoring and rewriting of arguments.
- Turning reasonably correct informal proofs into fully formal Lean proofs.
- Systematic reuse and “remixing” of existing methods at scale.
Some compare this to earlier tools like Isabelle’s Sledgehammer/CoqHammer, but note the new systems are much more powerful and general.
Many expect accelerating formalization of math (mathlib, Erdős problems, FLT, IMO problems) and eventual large‑scale automated checking of the literature.

Verification, formalization, and trust

Lean eliminates many LLM failure modes: if the formal statement is correct and Lean accepts the proof, the theorem is logically proved.
The remaining hard part is formalizing the intended statement:
- Natural-language problems can be subtly misencoded.
- Participants emphasize human review of the Lean statement and definitions, especially in tricky areas (analysis, probability, topology).
Some point out that this “specification gap” already exists in human proofs and software verification.

AGI vs “clever tools”

Several commenters insist this is still narrow AI:
- Works in a domain with automatic checking and strong structure.
- Requires expert guidance and extensive scaffolding.
Others see it as an early herald of AGI or “artificial general cleverness”:
- The same models that do code, language, and now frontier math may generalize further.
- Fears and hopes about knowledge-work automation and career impacts are expressed, but no consensus emerges.

Practitioner experiences and limits

Researchers report:
- Strong value for literature search, routine math, and explaining known tools.
- Much weaker performance on bleeding-edge questions in fields like quantum computing or specialized algorithms.
- AI is often more like a fast, fallible collaborator or “rubber duck” than an independent discoverer.
A recurring theme: AI excels at producing the 2nd, 3rd, 10th version of an idea once a human has a first draft or hunch.

Related topics