Quantitative AI progress needs accurate and transparent evaluation

Benchmarking, contamination, and Goodhart’s Law

  • Many see public benchmarks as indispensable yet “toxic” once used for marketing and leaderboard clout.
  • Widespread web scraping means almost any public or semi-public test likely contaminates training data, including synthetic-benchmark “tricks” distilled from larger models.
  • Several comments frame this as Goodhart’s Law: once a metric becomes a target, the problem shifts from pure measurement to an adversarial game with recursive dynamics.

Public vs private evals; “write your own tests”

  • Some argue the only trustworthy tests are privately created benchmarks never published, especially for open models; any test used on closed models should be treated as “burned.”
  • Others counter that private tests are also biased; ultimately all tests—public or private—are fallible and partly belief-driven.
  • Despite issues, many prefer benchmarks over “vibes” and ignore PR claims about tiny deltas on obscure benchmarks.

Costs, compute, and math achievements

  • Tao’s emphasis on reporting success rates and per-trial cost resonates; selectively reporting only successes badly misrepresents true cost.
  • Commenters note recent IMO-style math claims: without transparent compute budgets and error rates, “gold medal” headlines are misleading.
  • Some stress differences in evaluation rigor (third-party judging vs self-judging) and liken overfitted “specialized models” to F1 cars winning kids’ races.

Training data overlap, originality, and gaming

  • Several argue compute is less central than curation: making training data include near-duplicates of test problems is the easiest “path to success.”
  • FrontierMath and similar incidents are cited as evidence that access to or proximity to eval data can distort results.
  • Debate arises over how much location-based tasks (e.g., GeoGuessr) are solved by memorized Street View vs genuine generalization; claims conflict and remain unresolved.

Alternative evaluation ideas and reforms

  • Suggestions include:
    • Task-specific, user-owned evals and simple tooling to build them.
    • Reporting cost–performance tradeoffs (e.g., ARC-AGI-style score vs price plots, human baselines).
    • Data-compression–style tests as a proxy for “understanding rules” vs mere extrapolation.
    • Pre-registered evals analogous to pre-registered studies to reduce post-hoc cherry-picking.

Ethics, social impact, and discourse quality

  • Some criticize purely “technical” discussion as ignoring environmental and social harms (energy, water, labor, displacement); others say not every technical note must restate ethics.
  • Disagreement over tone policing and “hyperbole” reflects broader frustration with polarized AI debates.
  • Several lament the low quality of AI-related discussion on social platforms compared to relatively higher (though imperfect) standards on HN.

Math, formal methods, and LLMs

  • For pure math, some see LLMs mainly as front-ends to formal systems (Lean, Isabelle), with symbolic methods providing reliability.
  • Others emphasize hard theoretical limits (e.g., halting problem) and argue the frontier is LLMs + proof assistants together, not one replacing the other.