2025-11-08

Study identifies weaknesses in how AI systems are evaluated

State of LLM Benchmarks

Many commenters see current LLM benchmarks as a “wild west”: noisy, gamed, and only loosely correlated with real-world usefulness.
Leaderboards (e.g. crowdsourced comparison sites) are viewed as easily manipulable, biased toward short-context chat, and encouraging sycophantic tuning.
Closed-source training makes test-set contamination unknowable; some argue that for smaller/unknown labs this borders on fraud when used to raise money.

Why Evaluation Is Intrinsically Hard

LLM performance is multi-dimensional: context length, multi-turn instruction following, “agentic” behavior, domain knowledge, robustness, etc. A single headline metric is seen as hopelessly reductive.
Even in domains with clear ground truth (e.g. infra performance) people report widespread misuse of statistics, weak experimental design, and benchmarks that don’t predict production behavior.
Several draw parallels to human psychometrics and SAT/IQ testing: measuring “intelligence” or “reasoning” reliably is itself an unsolved problem.

Human Feedback, A/B Tests, and Reward Hacking

Human preference ratings and RLHF are described as highly exploitable, producing sycophancy and overconfident wrong answers.
A/B tests on engagement/retention are called “radioactive”: they reward behaviors like flattery or endless follow-up questions rather than correctness.

Crowd, Expert, and Private Evals

Users want simple rankings, but rigorous evaluation would require domain-expert panels (expensive and hard to scale).
Strong support for private, task-specific eval suites: keep your own corpus of problems and compare models on that, but don’t publish it to avoid training contamination.
For individual developers, many just “use it and see” in their real workflow; others warn this is subjective and advocate lightweight custom scoring harnesses.

Math, Reasoning, and Tool Use

Debate over math benchmarks (e.g. AIME-style questions): small-number success may reflect exam design, not real reasoning.
Some argue it’s unfair to expect raw arithmetic from LLMs; others say marketing positions them as general problem-solvers, so failures matter.
Growing consensus that serious evaluation should include tool-augmented setups (calculators, code, search), with the model deciding when to use them.

Incentives and Future Directions

Benchmarks are seen as optimized mainly for marketing and fundraising; users often perceive new frontier models as “same-ish” despite benchmark gains.
Suggestions include causal-inference-style evals, simulation/agent benchmarks, long-context and video tasks, and continuous “is it nerfed?” style tracking.
Despite skepticism, many accept benchmarks as “imperfect but better than vibes,” provided their limits are explicit and they’re complemented by real-world testing.

Related topics