Differential Transformer

Hallucinations, uncertainty, and “truth”

  • Several commenters stress the paper only mitigates hallucinations; it doesn’t “solve” them.
  • One camp argues hallucination is just model error: next-token prediction over imperfect data, not something that can be fully removed.
  • Another notes that if models could better “know what they know” (estimate uncertainty) and say “I don’t know,” hallucinations could be reduced in principle.
  • Multiple replies explain why raw token probabilities are not the same as truth probabilities and why low per-token probability does not simply mean “I don’t know.”
  • Others emphasize that LLMs always output plausible text; whether it matches reality depends on training data and objectives, not the mechanism alone.

How Differential Attention Works (as understood)

  • The core change: split Q/K into two groups, compute two softmax attention maps, then subtract (with a learnable scaling λ).
  • Intuition offered:
    • Standard softmax can’t truly zero out irrelevant tokens and gives tiny attention everywhere, which “poisons” outputs and has weak gradients.
    • Subtracting two softmaxes allows exact or near-zero and even negative attention, improving sparsity and expressiveness.
    • Outputs are no longer constrained to the convex hull of value vectors, giving each head more representational range.
  • Analogies used: differential signaling / noise-cancelling, “negative attention” (actively suppress distracting tokens), and canceling ROPE-induced long-range positional noise.
  • Some think the main effect is helping the model separate signal from irrelevant context, which then reduces hallucinations in QA/summarization.

Performance, scaling, and trade-offs

  • Reported gains: similar quality with ~65% of parameters/tokens; strong robustness under 4–6 bit quantization; improved long-context retrieval.
  • Cost: effectively doing attention twice per layer, increasing KV-cache size and reducing speed (~10–30% slower mentioned).
  • Several note that smaller models with this attention may still be attractive for edge or local inference despite the slowdown.

Creativity, RAG, and application concerns

  • Some worry reducing “hallucination” might also dampen creativity or interpolation between concepts; others see hallucination as sampling error, not creativity.
  • For RAG, stricter focus on provided documents is seen as unambiguously good.
  • Many view hallucinations as inevitable in text-only models; gains are about lowering rates, not eliminating them.

Methodological questions and skepticism

  • Commenters ask for clearer ablations: is the gain really from differential attention versus other hyperparameters, norms, activation functions, or training setups?
  • Some question the λ scheduling formula and lack of detailed justification.
  • Others see the change as simple, elegant, and likely to be widely tried, but note open questions about impact on generality vs robustness.