Differential Transformer
Hallucinations, uncertainty, and “truth”
- Several commenters stress the paper only mitigates hallucinations; it doesn’t “solve” them.
- One camp argues hallucination is just model error: next-token prediction over imperfect data, not something that can be fully removed.
- Another notes that if models could better “know what they know” (estimate uncertainty) and say “I don’t know,” hallucinations could be reduced in principle.
- Multiple replies explain why raw token probabilities are not the same as truth probabilities and why low per-token probability does not simply mean “I don’t know.”
- Others emphasize that LLMs always output plausible text; whether it matches reality depends on training data and objectives, not the mechanism alone.
How Differential Attention Works (as understood)
- The core change: split Q/K into two groups, compute two softmax attention maps, then subtract (with a learnable scaling λ).
- Intuition offered:
- Standard softmax can’t truly zero out irrelevant tokens and gives tiny attention everywhere, which “poisons” outputs and has weak gradients.
- Subtracting two softmaxes allows exact or near-zero and even negative attention, improving sparsity and expressiveness.
- Outputs are no longer constrained to the convex hull of value vectors, giving each head more representational range.
- Analogies used: differential signaling / noise-cancelling, “negative attention” (actively suppress distracting tokens), and canceling ROPE-induced long-range positional noise.
- Some think the main effect is helping the model separate signal from irrelevant context, which then reduces hallucinations in QA/summarization.
Performance, scaling, and trade-offs
- Reported gains: similar quality with ~65% of parameters/tokens; strong robustness under 4–6 bit quantization; improved long-context retrieval.
- Cost: effectively doing attention twice per layer, increasing KV-cache size and reducing speed (~10–30% slower mentioned).
- Several note that smaller models with this attention may still be attractive for edge or local inference despite the slowdown.
Creativity, RAG, and application concerns
- Some worry reducing “hallucination” might also dampen creativity or interpolation between concepts; others see hallucination as sampling error, not creativity.
- For RAG, stricter focus on provided documents is seen as unambiguously good.
- Many view hallucinations as inevitable in text-only models; gains are about lowering rates, not eliminating them.
Methodological questions and skepticism
- Commenters ask for clearer ablations: is the gain really from differential attention versus other hyperparameters, norms, activation functions, or training setups?
- Some question the λ scheduling formula and lack of detailed justification.
- Others see the change as simple, elegant, and likely to be widely tried, but note open questions about impact on generality vs robustness.