Disagreement among frontier LLMs on real-world fact-checks

Study setup and main finding

  • Five major LLMs were asked once per claim to classify 1,000 recent user-submitted “fact-check” claims into four buckets: True, Mostly True, Misleading, False, with no explanations and no option to abstain.
  • About two-thirds of claims had at least one model disagreeing with the others or no clear majority; ordinal Krippendorff’s α was reported as “limited but nontrivial” agreement.

Methodological critiques

  • Many commenters argue the headline “disagreement” rate is inflated by:
    • Treating small differences (True vs Mostly True, Misleading vs False) as disagreements.
    • Forcing a label without “I don’t know,” especially for unverifiable or post–training-cutoff events.
    • Using only a single deterministic pass per model and not measuring within-model variance.
  • Lack of a human baseline is seen as a major omission; similar human panels are known to disagree substantially on comparable tasks.
  • Some view the whole setup as evaluating the prompt/harness more than the underlying models.

Ambiguous labels and rubric issues

  • The four labels are seen as semantically fuzzy and overlapping, especially “Mostly True” and “Misleading.”
  • Without explicit rubric definitions or examples, models may be disagreeing on how to map nuanced situations into these buckets rather than on the underlying facts.

Time, search, and unanswerable claims

  • Several claims concern very recent events, future predictions, or inherently unprovable statements (e.g., extraterrestrial life).
  • Three models had only parametric knowledge; two had web search. Even those two disagreed often, suggesting retrieval does not trivially solve the problem.
  • Many argue that, for such items, the only correct behavior is to say “unknown,” which was deliberately disallowed.

Humans, bias, and the nature of facts

  • Commenters note that humans also disagree heavily on similar claims, especially political, forward-looking, or definition-dependent ones.
  • There is extended discussion of epistemology: facts vs evidence, probabilistic knowledge, and how “fact checking” often embeds value judgments or political framing.
  • Some point out that the corpus itself mixes clear factual items with opinions, predictions, and culturally contested language.

Usefulness, risks, and suggested improvements

  • Some see the results as confirmation that LLMs are unreliable fact-check oracles and should be used mainly as research assistants with human oversight.
  • Others argue the study underestimates models’ practical usefulness because it bans explanations, context, and interactive clarification—the way people actually use them.
  • Suggested follow-ups include: adding an abstain/unknown bucket, clearer rubrics and examples, collecting human labels, letting models “think out loud,” running multiple samples per model, separating minor vs polar disagreements, and publishing intra-model and human–model agreement.