AI engineers claim new algorithm reduces AI power consumption by 95%
What the algorithm is doing
- Many commenters relate L‑Mul to classic math tricks: using log-space (
log(x) + log(y)) or approximations like(1+a)(1+b) ≈ 1 + a + bwhena,bare small. - It operates on low‑precision formats (e.g., 8‑bit floats), approximating floating‑point multiplication via integer adds on exponent/mantissa bits, plus small correction terms.
- Several note this is conceptually close to logarithmic number systems, fixed‑point/Q‑format arithmetic, and long‑used DSP/FPGA techniques.
Claims about precision and energy savings
- The paper claims L‑Mul can match or beat 8‑bit FP (e4m3, e5m2) in precision and save up to ~95% energy for the multiplication operation, ~80% for dot products.
- Multiple commenters emphasize this 95% is per multiply, not overall model power; inference is often memory‑bandwidth‑dominated, so real end‑to‑end gains would be much smaller.
- An approximate‑computing researcher argues:
- Much power is in data movement, not the arithmetic itself.
- The paper’s accuracy comparison ignores standard “round to nearest even” in baseline FP, making the claimed superiority “non‑sensical.”
- Reported attention‑accuracy results lack detail on scaling/accumulation, so are hard to trust.
Practical applicability and hardware implications
- Consensus: this won’t remove the need for GPUs; parallelism for large models is still essential. It mainly targets more efficient inference and possibly training on suitably designed hardware.
- Current GPUs/CPUs are not optimized for this; specialized accelerators could, in principle, exploit it. Some expect any real benefit would prompt hardware vendors to respond.
- Debate on vendor impact:
- Some foresee “bad news” for Nvidia; others note Nvidia could simply implement the scheme in CUDA and still win.
- AMD’s ROCm and data‑center GPUs are discussed as partial alternatives but still trailing Nvidia in ecosystem maturity.
Experimentation and limitations
- A hand‑written AVX‑512 L‑Mul approximation applied directly to a FP16 Llama model produced gibberish outputs, suggesting models must be trained specifically for this arithmetic and/or only some layers can use it.
- One implementation (BitNet/bitnet.cpp) shows promising CPU speedups (≈1.4–6×) and 55–82% CPU energy reductions for certain 1‑bit/1.58‑bit models, but that is a different, though related, line of work.
Meta: hype, impact, and rebound effects
- Multiple comments criticize clickbait headlines and stress that results are theoretical or narrow; call for real, system‑level benchmarks.
- Some invoke Jevons paradox: more efficient AI may simply lead to far more AI usage, not less total energy.
- There is broader side‑discussion on whether LLMs’ productivity gains justify their energy and cost, with both strong advocates and skeptics represented.