Tensor Product Attention Is All You Need
Paper Title, Acronyms, and Naming Trends
- Many commenters are tired of “X is all you need” titles and see them as clickbait or SEO for citations.
- Others defend catchy titles as practical: they help papers be remembered and get read in a crowded field.
- The “T6” acronym (Tensor ProducT ATTenTion Transformer) is viewed as forced; an alternative “T-POT” is suggested but conflicts with an existing ML project.
- Some lament that HN discussions fixate on titles instead of content; others see title riffs as in-jokes referencing classic CS memes (“…considered harmful”).
Core Idea and Claimed Contributions
- The method factorizes Q, K, V as tensor products, reducing KV cache size by up to an order of magnitude during autoregressive inference.
- A key claim (esp. in section 3.4) is that it unifies various attention mechanisms (MHA, MQA, GQA) under one framework, with trade-offs between memory, compute, and representational power.
- One commenter praises the background section as especially clear and succinct.
Memory, Compute, and Performance Debate
- Practical concern: long context windows hurt (a) KV cache memory and (b) decode speed; this work addresses (a) clearly, but whether it helps (b) is debated.
- Some argue decompositions reduce memory traffic enough to improve speed; others note the paper shows training benchmarks only and no explicit inference speedups.
- There is extended disagreement over whether LLM inference is primarily memory-bound or compute-bound, and how batch size changes this. No consensus emerges.
- It’s clarified that this is not a post-hoc tensor decomposition of existing weights but an architecture that works directly with factorized components.
Context, Novelty, and Related Work
- Several comments explain that “Attention is All You Need” showed you could discard recurrence/convolutions and rely solely on attention while retaining performance.
- “Novel” in abstracts is defended as standard publishing practice driven by review criteria.
- A related preprint (“Element-wise Attention is All You Need”) is mentioned as potentially more efficient but not obviously subsumed by this framework.
- One question about why memory is said to grow linearly with sequence length (vs. expected quadratic scaling) is raised but not resolved (marked as unclear).