2025-01-22

Tensor Product Attention Is All You Need

Paper Title, Acronyms, and Naming Trends

Many commenters are tired of “X is all you need” titles and see them as clickbait or SEO for citations.
Others defend catchy titles as practical: they help papers be remembered and get read in a crowded field.
The “T6” acronym (Tensor ProducT ATTenTion Transformer) is viewed as forced; an alternative “T-POT” is suggested but conflicts with an existing ML project.
Some lament that HN discussions fixate on titles instead of content; others see title riffs as in-jokes referencing classic CS memes (“…considered harmful”).

Core Idea and Claimed Contributions

The method factorizes Q, K, V as tensor products, reducing KV cache size by up to an order of magnitude during autoregressive inference.
A key claim (esp. in section 3.4) is that it unifies various attention mechanisms (MHA, MQA, GQA) under one framework, with trade-offs between memory, compute, and representational power.
One commenter praises the background section as especially clear and succinct.

Memory, Compute, and Performance Debate

Practical concern: long context windows hurt (a) KV cache memory and (b) decode speed; this work addresses (a) clearly, but whether it helps (b) is debated.
Some argue decompositions reduce memory traffic enough to improve speed; others note the paper shows training benchmarks only and no explicit inference speedups.
There is extended disagreement over whether LLM inference is primarily memory-bound or compute-bound, and how batch size changes this. No consensus emerges.
It’s clarified that this is not a post-hoc tensor decomposition of existing weights but an architecture that works directly with factorized components.

Context, Novelty, and Related Work

Several comments explain that “Attention is All You Need” showed you could discard recurrence/convolutions and rely solely on attention while retaining performance.
“Novel” in abstracts is defended as standard publishing practice driven by review criteria.
A related preprint (“Element-wise Attention is All You Need”) is mentioned as potentially more efficient but not obviously subsumed by this framework.
One question about why memory is said to grow linearly with sequence length (vs. expected quadratic scaling) is raised but not resolved (marked as unclear).

Related topics