2026-02-07

First Proof

Purpose and Setup

Paper releases 10 math problems that arose naturally in ongoing research; statements are public but solutions are known only to authors and kept encrypted for a short time.
Aim is to probe whether current AI systems can tackle genuinely novel research-level questions, not just retrieve or lightly adapt existing results.

Benchmark vs. Exploratory Exercise

Several commenters stress the authors themselves say this is not a benchmark. The intent is an exploratory tool for “honest” researchers to see how models behave, ideally with full transcripts.
Critics argue that, despite disclaimers, the project is effectively framed as a benchmark and is very weak by ML standards: only 10 questions, little methodology detail, no systematic model comparison, prior art like FrontierMath already exists.

Methodological Concerns and Cheating

Strong worry about “AI company hires mathematicians and calls the result AI” and about humans solving problems during the embargo. Responses:
- Low stakes, very short timeline, diversity of questions, and request for reproducible logs/prompting make large-scale cheating unlikely, but not impossible.
- The exercise assumes good faith; adversarial misuse (PR cheating) is declared out-of-scope.
Prior testing on commercial models (Gemini, GPT) means big labs have had early exposure; some say this breaks the “not in the training set” claim, others see it as only extending the time window.

Nature and Interest of the Problems

Described as serious research-level questions, similar to lemmas or side-diversions in PhD work, not standard “Erdős puzzle” material and not “left to the reader” trivialities.
At least some look approachable (e.g., #7 “almost elementary”), and some are already being attacked via Lean; only a subset fits existing formal libraries.

Human–AI Collaboration vs. Autonomy

One strand emphasizes AI as a large-scale “association engine” in a centaur/human+AI mode, arguing that testing fully autonomous proofs misses the real value.
Others counter that centaur advantages are domain-specific (e.g., chess centaurs became obsolete; finance/architecture may differ) and that current LLMs remain unreliable, “gambling-like” tools.

Expectations and Reactions

Mixed predictions: some expect LLMs to crack 2–3 problems (with at least one easy and one interestingly different proof) but humans to solve more overall; others report repeated LLM failure on truly new problems.
Several commenters see value in more tightly time-controlled, contamination-conscious challenges like this, even if this particular effort is viewed by some as “sloppy” or “social-experiment-like” rather than rigorous science.

Related topics