First Proof
Purpose and Setup
- Paper releases 10 math problems that arose naturally in ongoing research; statements are public but solutions are known only to authors and kept encrypted for a short time.
- Aim is to probe whether current AI systems can tackle genuinely novel research-level questions, not just retrieve or lightly adapt existing results.
Benchmark vs. Exploratory Exercise
- Several commenters stress the authors themselves say this is not a benchmark. The intent is an exploratory tool for “honest” researchers to see how models behave, ideally with full transcripts.
- Critics argue that, despite disclaimers, the project is effectively framed as a benchmark and is very weak by ML standards: only 10 questions, little methodology detail, no systematic model comparison, prior art like FrontierMath already exists.
Methodological Concerns and Cheating
- Strong worry about “AI company hires mathematicians and calls the result AI” and about humans solving problems during the embargo. Responses:
- Low stakes, very short timeline, diversity of questions, and request for reproducible logs/prompting make large-scale cheating unlikely, but not impossible.
- The exercise assumes good faith; adversarial misuse (PR cheating) is declared out-of-scope.
- Prior testing on commercial models (Gemini, GPT) means big labs have had early exposure; some say this breaks the “not in the training set” claim, others see it as only extending the time window.
Nature and Interest of the Problems
- Described as serious research-level questions, similar to lemmas or side-diversions in PhD work, not standard “Erdős puzzle” material and not “left to the reader” trivialities.
- At least some look approachable (e.g., #7 “almost elementary”), and some are already being attacked via Lean; only a subset fits existing formal libraries.
Human–AI Collaboration vs. Autonomy
- One strand emphasizes AI as a large-scale “association engine” in a centaur/human+AI mode, arguing that testing fully autonomous proofs misses the real value.
- Others counter that centaur advantages are domain-specific (e.g., chess centaurs became obsolete; finance/architecture may differ) and that current LLMs remain unreliable, “gambling-like” tools.
Expectations and Reactions
- Mixed predictions: some expect LLMs to crack 2–3 problems (with at least one easy and one interestingly different proof) but humans to solve more overall; others report repeated LLM failure on truly new problems.
- Several commenters see value in more tightly time-controlled, contamination-conscious challenges like this, even if this particular effort is viewed by some as “sloppy” or “social-experiment-like” rather than rigorous science.