Quantitative AI progress needs accurate and transparent evaluation
Benchmarking, contamination, and Goodhart’s Law
- Many see public benchmarks as indispensable yet “toxic” once used for marketing and leaderboard clout.
- Widespread web scraping means almost any public or semi-public test likely contaminates training data, including synthetic-benchmark “tricks” distilled from larger models.
- Several comments frame this as Goodhart’s Law: once a metric becomes a target, the problem shifts from pure measurement to an adversarial game with recursive dynamics.
Public vs private evals; “write your own tests”
- Some argue the only trustworthy tests are privately created benchmarks never published, especially for open models; any test used on closed models should be treated as “burned.”
- Others counter that private tests are also biased; ultimately all tests—public or private—are fallible and partly belief-driven.
- Despite issues, many prefer benchmarks over “vibes” and ignore PR claims about tiny deltas on obscure benchmarks.
Costs, compute, and math achievements
- Tao’s emphasis on reporting success rates and per-trial cost resonates; selectively reporting only successes badly misrepresents true cost.
- Commenters note recent IMO-style math claims: without transparent compute budgets and error rates, “gold medal” headlines are misleading.
- Some stress differences in evaluation rigor (third-party judging vs self-judging) and liken overfitted “specialized models” to F1 cars winning kids’ races.
Training data overlap, originality, and gaming
- Several argue compute is less central than curation: making training data include near-duplicates of test problems is the easiest “path to success.”
- FrontierMath and similar incidents are cited as evidence that access to or proximity to eval data can distort results.
- Debate arises over how much location-based tasks (e.g., GeoGuessr) are solved by memorized Street View vs genuine generalization; claims conflict and remain unresolved.
Alternative evaluation ideas and reforms
- Suggestions include:
- Task-specific, user-owned evals and simple tooling to build them.
- Reporting cost–performance tradeoffs (e.g., ARC-AGI-style score vs price plots, human baselines).
- Data-compression–style tests as a proxy for “understanding rules” vs mere extrapolation.
- Pre-registered evals analogous to pre-registered studies to reduce post-hoc cherry-picking.
Ethics, social impact, and discourse quality
- Some criticize purely “technical” discussion as ignoring environmental and social harms (energy, water, labor, displacement); others say not every technical note must restate ethics.
- Disagreement over tone policing and “hyperbole” reflects broader frustration with polarized AI debates.
- Several lament the low quality of AI-related discussion on social platforms compared to relatively higher (though imperfect) standards on HN.
Math, formal methods, and LLMs
- For pure math, some see LLMs mainly as front-ends to formal systems (Lean, Isabelle), with symbolic methods providing reliability.
- Others emphasize hard theoretical limits (e.g., halting problem) and argue the frontier is LLMs + proof assistants together, not one replacing the other.