2024-10-19

AI engineers claim new algorithm reduces AI power consumption by 95%

What the algorithm is doing

Many commenters relate L‑Mul to classic math tricks: using log-space (log(x) + log(y)) or approximations like (1+a)(1+b) ≈ 1 + a + b when a,b are small.
It operates on low‑precision formats (e.g., 8‑bit floats), approximating floating‑point multiplication via integer adds on exponent/mantissa bits, plus small correction terms.
Several note this is conceptually close to logarithmic number systems, fixed‑point/Q‑format arithmetic, and long‑used DSP/FPGA techniques.

Claims about precision and energy savings

The paper claims L‑Mul can match or beat 8‑bit FP (e4m3, e5m2) in precision and save up to ~95% energy for the multiplication operation, ~80% for dot products.
Multiple commenters emphasize this 95% is per multiply, not overall model power; inference is often memory‑bandwidth‑dominated, so real end‑to‑end gains would be much smaller.
An approximate‑computing researcher argues:
- Much power is in data movement, not the arithmetic itself.
- The paper’s accuracy comparison ignores standard “round to nearest even” in baseline FP, making the claimed superiority “non‑sensical.”
- Reported attention‑accuracy results lack detail on scaling/accumulation, so are hard to trust.

Practical applicability and hardware implications

Consensus: this won’t remove the need for GPUs; parallelism for large models is still essential. It mainly targets more efficient inference and possibly training on suitably designed hardware.
Current GPUs/CPUs are not optimized for this; specialized accelerators could, in principle, exploit it. Some expect any real benefit would prompt hardware vendors to respond.
Debate on vendor impact:
- Some foresee “bad news” for Nvidia; others note Nvidia could simply implement the scheme in CUDA and still win.
- AMD’s ROCm and data‑center GPUs are discussed as partial alternatives but still trailing Nvidia in ecosystem maturity.

Experimentation and limitations

A hand‑written AVX‑512 L‑Mul approximation applied directly to a FP16 Llama model produced gibberish outputs, suggesting models must be trained specifically for this arithmetic and/or only some layers can use it.
One implementation (BitNet/bitnet.cpp) shows promising CPU speedups (≈1.4–6×) and 55–82% CPU energy reductions for certain 1‑bit/1.58‑bit models, but that is a different, though related, line of work.

Meta: hype, impact, and rebound effects

Multiple comments criticize clickbait headlines and stress that results are theoretical or narrow; call for real, system‑level benchmarks.
Some invoke Jevons paradox: more efficient AI may simply lead to far more AI usage, not less total energy.
There is broader side‑discussion on whether LLMs’ productivity gains justify their energy and cost, with both strong advocates and skeptics represented.

Related topics