2025-04-08

Meta got caught gaming AI benchmarks

What Meta allegedly did

Discussion centers on Meta deploying an “experimental chat” Llama 4 variant to LMArena, tuned for “conversationality” and low refusal rates, while using different variants for other benchmarks and marketing.
Some see this as benchmark gaming: fine‑tuning specifically for LMArena’s user-voted format and then presenting those scores as if they were for the general model.
Others argue “got caught” is overstated: Meta disclosed the variant in its own materials, and there’s little hard evidence of outright training-on-test-set cheating.

Debate over cheating vs framing

One subthread disputes a claim that OpenAI had previously been “caught” gaming the FrontierMath benchmark; a cited primary source explicitly denies using that data during training. Skeptics respond that even post‑hoc access to evals can still bias models.
Several comments note that gaming ML benchmarks is as old as ML itself and connect this to Goodhart’s law: once a benchmark becomes a target, it stops measuring what it used to.
Some commenters generalize to other labs (e.g., Grok/xAI) being accused of cherry‑picking outputs or using multi-run selection.

LMArena’s credibility and limitations

Multiple participants say LMArena was always weak scientifically:
– Self-selected users, no strong incentives for honest or careful voting.
– Evidence of sloppy or obviously wrong votes in released battle logs.
– Lower refusal rates and “yappy,” flattery-heavy answers appear to win, effectively “Elo hacking.”
Others like the head‑to‑head interface and report trying to vote carefully, but concede they may be in the minority.
There is concern that being #1 on LMArena is now a negative signal; some argue the benchmark may be saturated and should be rethought or retired.

Perception of Llama 4 and Meta’s AI strategy

Many see the Llama 4 launch as a debacle: worse than smaller or older models on practical tasks, overly verbose style, inconsistent quality across services, and poor public-facing experience (meta.ai).
There’s debate over Meta’s Mixture-of-Experts approach: some think it underdelivered relative to DeepSeek-style MoE; others say its performance is roughly what you’d expect given active vs total parameters.
A few point out one clear technical positive: very large context windows, which some users value highly.

Incentives, culture, and the broader AI race

Several comments blame Meta’s internal “performance culture” and promotion system: pressure to show short-term “impact,” ship half‑baked features, and move on encourages PSC‑gaming rather than depth and quality.
Comparisons are made to earlier Meta mottos like “move fast and break things,” with arguments that such approaches fail for large, high‑stakes systems.
Departures of senior and junior AI staff are mentioned, with speculation that pressure and reputational issues around Llama 4 and benchmarks may be contributing factors (unclear from the thread alone).

Economics, ethics, and trust

People note the oddity of tech giants pouring money into loss‑making AI and VR, interpreting it as a platform/control play and an investor‑story necessity.
Some raise speculative worries that Llama licenses could later be used to exert control or extract rents, since the models are “open‑weight” but not truly open source.
Several comments link benchmark gaming to broader corporate dishonesty and, half‑seriously, to potential securities‑fraud territory if investors were misled about AI capabilities.
Ethical criticism also surfaces around training data (copyrighted content, personal photos) and the general pattern of large firms cutting corners to sustain AI hype.

Related topics