2024-10-30

LLMs know more than they show: On the intrinsic representation of hallucinations

Scope of the paper and related work

Thread sees this paper as part of a broader line: probing internal activations to detect truth/falsehood and “know-what-you-know” calibration.
Several related papers are cited that claim LLMs often have internal signals about correctness even when not expressed in outputs, though at least one of these is heavily criticized as overclaiming relative to its figures.
The specific contribution here is framed as: truth-related information is concentrated in certain “critical tokens” and mid-layer activations, which can be used to better detect errors.

Can LLMs encode “truthfulness”?

One camp: “truthfulness” is not a meaningful internal property for systems trained only on token correlations; they learn patterns, not truth.
Counter-arguments:
- Humans also mostly learn from language and social consensus, not direct experience.
- Models can learn categories like “false” or “trivia” from textual patterns (“X is not Y”) and infer that some unseen statements likely belong to the “false” category.
- Truth-like dimensions could emerge in embedding space much like sentiment or sarcasm.

Hallucinations: inherent vs mitigable

Strong skeptics argue hallucinations are fundamental: every output is a sample from a distribution, so “fixing hallucinations” is conceptually wrong; at best you reduce error rates.
Others see value in:
- Taxonomizing hallucination types and causes.
- Building detectors that flag low-confidence or likely-wrong answers for human review.
- Using uncertainty measures (entropy, calibration probes) or resampling strategies to reduce harmful errors.

Comparison to humans and continual learning

Repeated analogy: humans also hold false beliefs and update via social “swarms”; current LLMs are static weights, which limits their ability to converge toward consensus truth.
Some advocate multi-LLM “swarms” and online learning; others note this is technically and operationally difficult today.

Philosophical and definitional disputes

Long subthread on whether human reasoning is “non-statistical,” whether any non-physical “soul” is implied, and whether talk of “knowing” or “truth” for LLMs is meaningful.
Some want more precise, less anthropomorphic language (“internal error signal” vs “knows it’s wrong”) to avoid confusion.

Skepticism about research quality and hype

Several comments complain that papers and headlines overstate findings (“we found the gene for cancer” vibe).
Concerns include weak correlations, poor out-of-distribution performance, and the risk of cherry-picking papers that fit a preferred narrative (either “LLMs know” or “LLMs will always hallucinate”).

Related topics