Addition is all you need for energy-efficient language models

Compute vs memory and real energy savings

  • Several commenters argue that transformers are more memory-bandwidth-bound than compute-bound, especially for single-user / small-batch inference.
  • The cited “95% / 80% energy reduction” is criticized as being measured only on isolated fp32 multipliers/dot products, not end-to-end inference, where fetching weights dominates power.
  • Others note that prefill and multi-batch decoding, training, and large-batch inference can still be compute-dominated, so compute-efficient schemes may matter more there.
  • Consensus: reducing multiplications helps, but without reducing memory traffic, system-level gains may be modest.

Numeric formats: fp32, fp16/BF16, fp8, fp4, int

  • fp32 is seen as overkill for inference; fp16/BF16 are “unquantized,” fp8 is “lightly quantized” and widely used for large LLMs with small quality loss.
  • Some point out that the paper’s power claims are for fp32, while its accuracy results are for fp8, calling this comparison “disingenuous.”
  • Discussion of fp4/fp8 as compressed formats with shared scaling factors; multiplications can be LUT-based, but accumulations still require higher precision.
  • There’s debate on when to use which precision; rule of thumb: use the lowest precision that fits quality and memory constraints, with diminishing returns above fp8 at inference.

Logarithmic / addition-only representations

  • Multiple commenters identify the method as a form of logarithmic number system where multiplications become additions.
  • The difficult part is handling accumulations and wide dynamic ranges in log space without large errors.
  • Prior related work is cited (log-number representations, approximate gradients), and some are surprised the paper doesn’t engage more with that literature or derive error terms clearly.

Hardware implications and ecosystem

  • Some envision custom architectures with compute colocated with memory (systolic arrays, compute-in-memory, FPGA/DRAM ALUs) where addition-heavy schemes could shine.
  • Others stress that even with addition-only kernels, the workload remains massively parallel and still maps well to GPUs.
  • Question raised whether this approach would be faster in practice; thread notes the paper emphasizes energy, not latency, and specialized hardware is explicitly recommended and “patent pending.”

Corporate influence and Nvidia speculation

  • One commenter proposes a conspiracy theory that GPU vendors suppress research that would devalue multipliers; others strongly reject this, citing:
    • Competing funders (big tech companies) would have incentives to support such work.
    • GPU vendors themselves publish research on novel number formats and log-based schemes.
    • Most of Nvidia’s advantage is attributed to ecosystem and architecture, not just multipliers.