First Proof

Purpose and Setup

  • Paper releases 10 math problems that arose naturally in ongoing research; statements are public but solutions are known only to authors and kept encrypted for a short time.
  • Aim is to probe whether current AI systems can tackle genuinely novel research-level questions, not just retrieve or lightly adapt existing results.

Benchmark vs. Exploratory Exercise

  • Several commenters stress the authors themselves say this is not a benchmark. The intent is an exploratory tool for “honest” researchers to see how models behave, ideally with full transcripts.
  • Critics argue that, despite disclaimers, the project is effectively framed as a benchmark and is very weak by ML standards: only 10 questions, little methodology detail, no systematic model comparison, prior art like FrontierMath already exists.

Methodological Concerns and Cheating

  • Strong worry about “AI company hires mathematicians and calls the result AI” and about humans solving problems during the embargo. Responses:
    • Low stakes, very short timeline, diversity of questions, and request for reproducible logs/prompting make large-scale cheating unlikely, but not impossible.
    • The exercise assumes good faith; adversarial misuse (PR cheating) is declared out-of-scope.
  • Prior testing on commercial models (Gemini, GPT) means big labs have had early exposure; some say this breaks the “not in the training set” claim, others see it as only extending the time window.

Nature and Interest of the Problems

  • Described as serious research-level questions, similar to lemmas or side-diversions in PhD work, not standard “Erdős puzzle” material and not “left to the reader” trivialities.
  • At least some look approachable (e.g., #7 “almost elementary”), and some are already being attacked via Lean; only a subset fits existing formal libraries.

Human–AI Collaboration vs. Autonomy

  • One strand emphasizes AI as a large-scale “association engine” in a centaur/human+AI mode, arguing that testing fully autonomous proofs misses the real value.
  • Others counter that centaur advantages are domain-specific (e.g., chess centaurs became obsolete; finance/architecture may differ) and that current LLMs remain unreliable, “gambling-like” tools.

Expectations and Reactions

  • Mixed predictions: some expect LLMs to crack 2–3 problems (with at least one easy and one interestingly different proof) but humans to solve more overall; others report repeated LLM failure on truly new problems.
  • Several commenters see value in more tightly time-controlled, contamination-conscious challenges like this, even if this particular effort is viewed by some as “sloppy” or “social-experiment-like” rather than rigorous science.