2024-10-09

Addition is all you need for energy-efficient language models

Compute vs memory and real energy savings

Several commenters argue that transformers are more memory-bandwidth-bound than compute-bound, especially for single-user / small-batch inference.
The cited “95% / 80% energy reduction” is criticized as being measured only on isolated fp32 multipliers/dot products, not end-to-end inference, where fetching weights dominates power.
Others note that prefill and multi-batch decoding, training, and large-batch inference can still be compute-dominated, so compute-efficient schemes may matter more there.
Consensus: reducing multiplications helps, but without reducing memory traffic, system-level gains may be modest.

Numeric formats: fp32, fp16/BF16, fp8, fp4, int

fp32 is seen as overkill for inference; fp16/BF16 are “unquantized,” fp8 is “lightly quantized” and widely used for large LLMs with small quality loss.
Some point out that the paper’s power claims are for fp32, while its accuracy results are for fp8, calling this comparison “disingenuous.”
Discussion of fp4/fp8 as compressed formats with shared scaling factors; multiplications can be LUT-based, but accumulations still require higher precision.
There’s debate on when to use which precision; rule of thumb: use the lowest precision that fits quality and memory constraints, with diminishing returns above fp8 at inference.

Logarithmic / addition-only representations

Multiple commenters identify the method as a form of logarithmic number system where multiplications become additions.
The difficult part is handling accumulations and wide dynamic ranges in log space without large errors.
Prior related work is cited (log-number representations, approximate gradients), and some are surprised the paper doesn’t engage more with that literature or derive error terms clearly.

Hardware implications and ecosystem

Some envision custom architectures with compute colocated with memory (systolic arrays, compute-in-memory, FPGA/DRAM ALUs) where addition-heavy schemes could shine.
Others stress that even with addition-only kernels, the workload remains massively parallel and still maps well to GPUs.
Question raised whether this approach would be faster in practice; thread notes the paper emphasizes energy, not latency, and specialized hardware is explicitly recommended and “patent pending.”

Corporate influence and Nvidia speculation

One commenter proposes a conspiracy theory that GPU vendors suppress research that would devalue multipliers; others strongly reject this, citing:
- Competing funders (big tech companies) would have incentives to support such work.
- GPU vendors themselves publish research on novel number formats and log-based schemes.
- Most of Nvidia’s advantage is attributed to ecosystem and architecture, not just multipliers.

Related topics