2024-06-23

Detecting hallucinations in large language models using semantic entropy

What LLM “hallucinations” are

Many argue LLMs are “orthogonal to truth”: they optimize for plausible text, not correctness.
Several posters prefer terms like “bullshit” (truth-indifferent output) or “confabulation” (false but fluent narratives) over “hallucination” (which suggests misperception by a mind).
Others think “hallucination” is now a useful term of art and language drift is fine; critics counter that sloppy metaphors will mislead policymakers and the public.

Intentionality and anthropomorphism

Long subthread on whether algorithms can “intend,” “lie,” or “care about” truth.
One side: intentions require minds; computers are just deterministic / formal systems, so attributing mental states is a category error.
Other side: we lack a settled theory of intentionality; under many metaphysical views (physicalism, panpsychism, some idealisms), it’s at least possible AIs could have genuine or pseudo-intentionality, so strong denials are premature.
Several note that for everyday use, simulated intentionality (acting as if intentional) is often enough.

Semantic entropy method

Core idea: sample multiple answers, cluster them by semantic equivalence (via another model), then compute an entropy over clusters.
High semantic entropy ≈ model gives many divergent meanings ⇒ likely “confabulation.”
Low entropy ≈ model consistently produces similar meanings ⇒ more grounded in training data.
A variant decomposes answers into factoids, reformulates each as a question, and re-checks each factoid with the entropy method.

Critiques and limitations

High agreement does not guarantee truth; a model can be confidently wrong (e.g., outdated training data, popular misconceptions).
Entropy measures dispersion of the output distribution, not correctness of that distribution; knowing “how certain” the model is is different from knowing it’s right.
May mislabel creative or multi-valid-answer tasks as hallucinations.
Some see this as just another heuristic layered on top of a fundamentally non-truth-seeking system; others find it a useful partial safety tool, especially when full retraining is impossible.

Use cases and broader attitudes

Proposed for high-stakes settings (e.g., public agencies) to suppress low-confidence answers and escalate to humans.
Some argue we should frame LLMs as “hint providers” / improv text generators, not near-AGI or truth oracles.
Ongoing tension between recognizing real utility (coding help, explanation, synthesis) and concern over overhype, misuse, and misplaced trust.

Related topics