Addition is all you need for energy-efficient language models
Compute vs memory and real energy savings
- Several commenters argue that transformers are more memory-bandwidth-bound than compute-bound, especially for single-user / small-batch inference.
- The cited “95% / 80% energy reduction” is criticized as being measured only on isolated fp32 multipliers/dot products, not end-to-end inference, where fetching weights dominates power.
- Others note that prefill and multi-batch decoding, training, and large-batch inference can still be compute-dominated, so compute-efficient schemes may matter more there.
- Consensus: reducing multiplications helps, but without reducing memory traffic, system-level gains may be modest.
Numeric formats: fp32, fp16/BF16, fp8, fp4, int
- fp32 is seen as overkill for inference; fp16/BF16 are “unquantized,” fp8 is “lightly quantized” and widely used for large LLMs with small quality loss.
- Some point out that the paper’s power claims are for fp32, while its accuracy results are for fp8, calling this comparison “disingenuous.”
- Discussion of fp4/fp8 as compressed formats with shared scaling factors; multiplications can be LUT-based, but accumulations still require higher precision.
- There’s debate on when to use which precision; rule of thumb: use the lowest precision that fits quality and memory constraints, with diminishing returns above fp8 at inference.
Logarithmic / addition-only representations
- Multiple commenters identify the method as a form of logarithmic number system where multiplications become additions.
- The difficult part is handling accumulations and wide dynamic ranges in log space without large errors.
- Prior related work is cited (log-number representations, approximate gradients), and some are surprised the paper doesn’t engage more with that literature or derive error terms clearly.
Hardware implications and ecosystem
- Some envision custom architectures with compute colocated with memory (systolic arrays, compute-in-memory, FPGA/DRAM ALUs) where addition-heavy schemes could shine.
- Others stress that even with addition-only kernels, the workload remains massively parallel and still maps well to GPUs.
- Question raised whether this approach would be faster in practice; thread notes the paper emphasizes energy, not latency, and specialized hardware is explicitly recommended and “patent pending.”
Corporate influence and Nvidia speculation
- One commenter proposes a conspiracy theory that GPU vendors suppress research that would devalue multipliers; others strongly reject this, citing:
- Competing funders (big tech companies) would have incentives to support such work.
- GPU vendors themselves publish research on novel number formats and log-based schemes.
- Most of Nvidia’s advantage is attributed to ecosystem and architecture, not just multipliers.