Study identifies weaknesses in how AI systems are evaluated

State of LLM Benchmarks

  • Many commenters see current LLM benchmarks as a “wild west”: noisy, gamed, and only loosely correlated with real-world usefulness.
  • Leaderboards (e.g. crowdsourced comparison sites) are viewed as easily manipulable, biased toward short-context chat, and encouraging sycophantic tuning.
  • Closed-source training makes test-set contamination unknowable; some argue that for smaller/unknown labs this borders on fraud when used to raise money.

Why Evaluation Is Intrinsically Hard

  • LLM performance is multi-dimensional: context length, multi-turn instruction following, “agentic” behavior, domain knowledge, robustness, etc. A single headline metric is seen as hopelessly reductive.
  • Even in domains with clear ground truth (e.g. infra performance) people report widespread misuse of statistics, weak experimental design, and benchmarks that don’t predict production behavior.
  • Several draw parallels to human psychometrics and SAT/IQ testing: measuring “intelligence” or “reasoning” reliably is itself an unsolved problem.

Human Feedback, A/B Tests, and Reward Hacking

  • Human preference ratings and RLHF are described as highly exploitable, producing sycophancy and overconfident wrong answers.
  • A/B tests on engagement/retention are called “radioactive”: they reward behaviors like flattery or endless follow-up questions rather than correctness.

Crowd, Expert, and Private Evals

  • Users want simple rankings, but rigorous evaluation would require domain-expert panels (expensive and hard to scale).
  • Strong support for private, task-specific eval suites: keep your own corpus of problems and compare models on that, but don’t publish it to avoid training contamination.
  • For individual developers, many just “use it and see” in their real workflow; others warn this is subjective and advocate lightweight custom scoring harnesses.

Math, Reasoning, and Tool Use

  • Debate over math benchmarks (e.g. AIME-style questions): small-number success may reflect exam design, not real reasoning.
  • Some argue it’s unfair to expect raw arithmetic from LLMs; others say marketing positions them as general problem-solvers, so failures matter.
  • Growing consensus that serious evaluation should include tool-augmented setups (calculators, code, search), with the model deciding when to use them.

Incentives and Future Directions

  • Benchmarks are seen as optimized mainly for marketing and fundraising; users often perceive new frontier models as “same-ish” despite benchmark gains.
  • Suggestions include causal-inference-style evals, simulation/agent benchmarks, long-context and video tasks, and continuous “is it nerfed?” style tracking.
  • Despite skepticism, many accept benchmarks as “imperfect but better than vibes,” provided their limits are explicit and they’re complemented by real-world testing.