2025-07-25

Quantitative AI progress needs accurate and transparent evaluation

Benchmarking, contamination, and Goodhart’s Law

Many see public benchmarks as indispensable yet “toxic” once used for marketing and leaderboard clout.
Widespread web scraping means almost any public or semi-public test likely contaminates training data, including synthetic-benchmark “tricks” distilled from larger models.
Several comments frame this as Goodhart’s Law: once a metric becomes a target, the problem shifts from pure measurement to an adversarial game with recursive dynamics.

Public vs private evals; “write your own tests”

Some argue the only trustworthy tests are privately created benchmarks never published, especially for open models; any test used on closed models should be treated as “burned.”
Others counter that private tests are also biased; ultimately all tests—public or private—are fallible and partly belief-driven.
Despite issues, many prefer benchmarks over “vibes” and ignore PR claims about tiny deltas on obscure benchmarks.

Costs, compute, and math achievements

Tao’s emphasis on reporting success rates and per-trial cost resonates; selectively reporting only successes badly misrepresents true cost.
Commenters note recent IMO-style math claims: without transparent compute budgets and error rates, “gold medal” headlines are misleading.
Some stress differences in evaluation rigor (third-party judging vs self-judging) and liken overfitted “specialized models” to F1 cars winning kids’ races.

Training data overlap, originality, and gaming

Several argue compute is less central than curation: making training data include near-duplicates of test problems is the easiest “path to success.”
FrontierMath and similar incidents are cited as evidence that access to or proximity to eval data can distort results.
Debate arises over how much location-based tasks (e.g., GeoGuessr) are solved by memorized Street View vs genuine generalization; claims conflict and remain unresolved.

Alternative evaluation ideas and reforms

Suggestions include:
- Task-specific, user-owned evals and simple tooling to build them.
- Reporting cost–performance tradeoffs (e.g., ARC-AGI-style score vs price plots, human baselines).
- Data-compression–style tests as a proxy for “understanding rules” vs mere extrapolation.
- Pre-registered evals analogous to pre-registered studies to reduce post-hoc cherry-picking.

Ethics, social impact, and discourse quality

Some criticize purely “technical” discussion as ignoring environmental and social harms (energy, water, labor, displacement); others say not every technical note must restate ethics.
Disagreement over tone policing and “hyperbole” reflects broader frustration with polarized AI debates.
Several lament the low quality of AI-related discussion on social platforms compared to relatively higher (though imperfect) standards on HN.

Math, formal methods, and LLMs

For pure math, some see LLMs mainly as front-ends to formal systems (Lean, Isabelle), with symbolic methods providing reliability.
Others emphasize hard theoretical limits (e.g., halting problem) and argue the frontier is LLMs + proof assistants together, not one replacing the other.

Related topics