Study identifies weaknesses in how AI systems are evaluated
State of LLM Benchmarks
- Many commenters see current LLM benchmarks as a “wild west”: noisy, gamed, and only loosely correlated with real-world usefulness.
- Leaderboards (e.g. crowdsourced comparison sites) are viewed as easily manipulable, biased toward short-context chat, and encouraging sycophantic tuning.
- Closed-source training makes test-set contamination unknowable; some argue that for smaller/unknown labs this borders on fraud when used to raise money.
Why Evaluation Is Intrinsically Hard
- LLM performance is multi-dimensional: context length, multi-turn instruction following, “agentic” behavior, domain knowledge, robustness, etc. A single headline metric is seen as hopelessly reductive.
- Even in domains with clear ground truth (e.g. infra performance) people report widespread misuse of statistics, weak experimental design, and benchmarks that don’t predict production behavior.
- Several draw parallels to human psychometrics and SAT/IQ testing: measuring “intelligence” or “reasoning” reliably is itself an unsolved problem.
Human Feedback, A/B Tests, and Reward Hacking
- Human preference ratings and RLHF are described as highly exploitable, producing sycophancy and overconfident wrong answers.
- A/B tests on engagement/retention are called “radioactive”: they reward behaviors like flattery or endless follow-up questions rather than correctness.
Crowd, Expert, and Private Evals
- Users want simple rankings, but rigorous evaluation would require domain-expert panels (expensive and hard to scale).
- Strong support for private, task-specific eval suites: keep your own corpus of problems and compare models on that, but don’t publish it to avoid training contamination.
- For individual developers, many just “use it and see” in their real workflow; others warn this is subjective and advocate lightweight custom scoring harnesses.
Math, Reasoning, and Tool Use
- Debate over math benchmarks (e.g. AIME-style questions): small-number success may reflect exam design, not real reasoning.
- Some argue it’s unfair to expect raw arithmetic from LLMs; others say marketing positions them as general problem-solvers, so failures matter.
- Growing consensus that serious evaluation should include tool-augmented setups (calculators, code, search), with the model deciding when to use them.
Incentives and Future Directions
- Benchmarks are seen as optimized mainly for marketing and fundraising; users often perceive new frontier models as “same-ish” despite benchmark gains.
- Suggestions include causal-inference-style evals, simulation/agent benchmarks, long-context and video tasks, and continuous “is it nerfed?” style tracking.
- Despite skepticism, many accept benchmarks as “imperfect but better than vibes,” provided their limits are explicit and they’re complemented by real-world testing.