LLMs Will Always Hallucinate, and We Need to Live with This

What “hallucination” means

  • Many argue “hallucination” is a misleading term; LLMs are doing normal probabilistic text generation, not suffering a discrete malfunction.
  • Several say all outputs are essentially hallucinations: probabilistic strings with no built‑in notion of truth; some just happen to match reality.
  • Others prefer terms like “confabulation,” “bullshit,” or simply “inaccuracy,” emphasizing that correctness is a judgment by readers, not the model.
  • One line of argument: “hallucinations” and “alignment” are the same technical problem—constraining outputs to what some authority deems acceptable (truth, safety, morality, etc.).

Inevitability vs mitigation

  • Some accept the paper’s point that zero hallucinations is impossible in principle, but note this says little about how small the error rate can become in practice.
  • Comparisons: quantum tunneling (nonzero but negligible), or the halting problem (theoretical limit vs engineering usefulness).
  • Others see current LLM architectures as fundamentally hallucination‑prone and think this will cap their practical scope.
  • A minority says hallucination is a feature for creativity, fiction, and idea generation; a perfectly “truthful” model would be closer to copy‑paste and less useful creatively.

LLMs vs human cognition

  • One camp emphasizes differences: humans can often say “I don’t know,” calibrate confidence, and learn from mistakes; LLMs tend to answer confidently regardless.
  • Another camp stresses similarities: humans also misremember, confabulate, believe nonsense, and “complete the next word” when speaking; some are worse than today’s LLMs.
  • Debate over whether human “intelligence” is qualitatively different or mainly a matter of scale, architecture, and evolutionary pre‑training.

Appropriate use cases

  • Consensus that LLMs are useful where:
    • Outputs are low‑stakes (summaries, boilerplate, creative text, brainstorming).
    • Humans can efficiently verify or correct candidate answers.
  • Strong skepticism for high‑stakes domains (law, medicine, critical research, automation with no human in the loop), because even rare hallucinations can be catastrophic.
  • Some argue true automation requires superhuman reliability, not “human‑level fallibility,” so LLMs are a poor fit as general human replacements.

Mitigation and product design

  • Proposed mitigations include: using token probabilities to estimate confidence, multiple generations and consistency checks, post‑training to reduce overconfident wrong answers, and external retrieval/sanity‑checking.
  • Disagreement whether hallucinations are:
    • A “bug” to be fixed inside the model,
    • A deeper design limitation of next‑token prediction, or
    • An inevitable property that must be managed in the surrounding product (e.g., verification layers, constrained domains).

Hype, business, and ethics

  • Many criticize marketing that presents LLMs as oracles or universal automation, especially to users habituated to trusting top search results.
  • Some see “hallucinations” being downplayed to keep the AGI/AI‑bubble narrative going and justify further investment.
  • Others argue that even fallible tools are worthwhile, but only if users maintain a realistic mental model of their limitations.