Disagreement among frontier LLMs on real-world fact-checks
Study setup and main finding
- Five major LLMs were asked once per claim to classify 1,000 recent user-submitted “fact-check” claims into four buckets: True, Mostly True, Misleading, False, with no explanations and no option to abstain.
- About two-thirds of claims had at least one model disagreeing with the others or no clear majority; ordinal Krippendorff’s α was reported as “limited but nontrivial” agreement.
Methodological critiques
- Many commenters argue the headline “disagreement” rate is inflated by:
- Treating small differences (True vs Mostly True, Misleading vs False) as disagreements.
- Forcing a label without “I don’t know,” especially for unverifiable or post–training-cutoff events.
- Using only a single deterministic pass per model and not measuring within-model variance.
- Lack of a human baseline is seen as a major omission; similar human panels are known to disagree substantially on comparable tasks.
- Some view the whole setup as evaluating the prompt/harness more than the underlying models.
Ambiguous labels and rubric issues
- The four labels are seen as semantically fuzzy and overlapping, especially “Mostly True” and “Misleading.”
- Without explicit rubric definitions or examples, models may be disagreeing on how to map nuanced situations into these buckets rather than on the underlying facts.
Time, search, and unanswerable claims
- Several claims concern very recent events, future predictions, or inherently unprovable statements (e.g., extraterrestrial life).
- Three models had only parametric knowledge; two had web search. Even those two disagreed often, suggesting retrieval does not trivially solve the problem.
- Many argue that, for such items, the only correct behavior is to say “unknown,” which was deliberately disallowed.
Humans, bias, and the nature of facts
- Commenters note that humans also disagree heavily on similar claims, especially political, forward-looking, or definition-dependent ones.
- There is extended discussion of epistemology: facts vs evidence, probabilistic knowledge, and how “fact checking” often embeds value judgments or political framing.
- Some point out that the corpus itself mixes clear factual items with opinions, predictions, and culturally contested language.
Usefulness, risks, and suggested improvements
- Some see the results as confirmation that LLMs are unreliable fact-check oracles and should be used mainly as research assistants with human oversight.
- Others argue the study underestimates models’ practical usefulness because it bans explanations, context, and interactive clarification—the way people actually use them.
- Suggested follow-ups include: adding an abstain/unknown bucket, clearer rubrics and examples, collecting human labels, letting models “think out loud,” running multiple samples per model, separating minor vs polar disagreements, and publishing intra-model and human–model agreement.