2024-10-08

Differential Transformer

Hallucinations, uncertainty, and “truth”

Several commenters stress the paper only mitigates hallucinations; it doesn’t “solve” them.
One camp argues hallucination is just model error: next-token prediction over imperfect data, not something that can be fully removed.
Another notes that if models could better “know what they know” (estimate uncertainty) and say “I don’t know,” hallucinations could be reduced in principle.
Multiple replies explain why raw token probabilities are not the same as truth probabilities and why low per-token probability does not simply mean “I don’t know.”
Others emphasize that LLMs always output plausible text; whether it matches reality depends on training data and objectives, not the mechanism alone.

How Differential Attention Works (as understood)

The core change: split Q/K into two groups, compute two softmax attention maps, then subtract (with a learnable scaling λ).
Intuition offered:
- Standard softmax can’t truly zero out irrelevant tokens and gives tiny attention everywhere, which “poisons” outputs and has weak gradients.
- Subtracting two softmaxes allows exact or near-zero and even negative attention, improving sparsity and expressiveness.
- Outputs are no longer constrained to the convex hull of value vectors, giving each head more representational range.
Analogies used: differential signaling / noise-cancelling, “negative attention” (actively suppress distracting tokens), and canceling ROPE-induced long-range positional noise.
Some think the main effect is helping the model separate signal from irrelevant context, which then reduces hallucinations in QA/summarization.

Performance, scaling, and trade-offs

Reported gains: similar quality with ~65% of parameters/tokens; strong robustness under 4–6 bit quantization; improved long-context retrieval.
Cost: effectively doing attention twice per layer, increasing KV-cache size and reducing speed (~10–30% slower mentioned).
Several note that smaller models with this attention may still be attractive for edge or local inference despite the slowdown.

Creativity, RAG, and application concerns

Some worry reducing “hallucination” might also dampen creativity or interpolation between concepts; others see hallucination as sampling error, not creativity.
For RAG, stricter focus on provided documents is seen as unambiguously good.
Many view hallucinations as inevitable in text-only models; gains are about lowering rates, not eliminating them.

Methodological questions and skepticism

Commenters ask for clearer ablations: is the gain really from differential attention versus other hyperparameters, norms, activation functions, or training setups?
Some question the λ scheduling formula and lack of detailed justification.
Others see the change as simple, elegant, and likely to be widely tried, but note open questions about impact on generality vs robustness.

Related topics