2026-05-28

Disagreement among frontier LLMs on real-world fact-checks

Study setup and main finding

Five major LLMs were asked once per claim to classify 1,000 recent user-submitted “fact-check” claims into four buckets: True, Mostly True, Misleading, False, with no explanations and no option to abstain.
About two-thirds of claims had at least one model disagreeing with the others or no clear majority; ordinal Krippendorff’s α was reported as “limited but nontrivial” agreement.

Methodological critiques

Many commenters argue the headline “disagreement” rate is inflated by:
- Treating small differences (True vs Mostly True, Misleading vs False) as disagreements.
- Forcing a label without “I don’t know,” especially for unverifiable or post–training-cutoff events.
- Using only a single deterministic pass per model and not measuring within-model variance.
Lack of a human baseline is seen as a major omission; similar human panels are known to disagree substantially on comparable tasks.
Some view the whole setup as evaluating the prompt/harness more than the underlying models.

Ambiguous labels and rubric issues

The four labels are seen as semantically fuzzy and overlapping, especially “Mostly True” and “Misleading.”
Without explicit rubric definitions or examples, models may be disagreeing on how to map nuanced situations into these buckets rather than on the underlying facts.

Time, search, and unanswerable claims

Several claims concern very recent events, future predictions, or inherently unprovable statements (e.g., extraterrestrial life).
Three models had only parametric knowledge; two had web search. Even those two disagreed often, suggesting retrieval does not trivially solve the problem.
Many argue that, for such items, the only correct behavior is to say “unknown,” which was deliberately disallowed.

Humans, bias, and the nature of facts

Commenters note that humans also disagree heavily on similar claims, especially political, forward-looking, or definition-dependent ones.
There is extended discussion of epistemology: facts vs evidence, probabilistic knowledge, and how “fact checking” often embeds value judgments or political framing.
Some point out that the corpus itself mixes clear factual items with opinions, predictions, and culturally contested language.

Usefulness, risks, and suggested improvements

Some see the results as confirmation that LLMs are unreliable fact-check oracles and should be used mainly as research assistants with human oversight.
Others argue the study underestimates models’ practical usefulness because it bans explanations, context, and interactive clarification—the way people actually use them.
Suggested follow-ups include: adding an abstain/unknown bucket, clearer rubrics and examples, collecting human labels, letting models “think out loud,” running multiple samples per model, separating minor vs polar disagreements, and publishing intra-model and human–model agreement.

Related topics