AI engineers claim new algorithm reduces AI power consumption by 95%

What the algorithm is doing

  • Many commenters relate L‑Mul to classic math tricks: using log-space (log(x) + log(y)) or approximations like (1+a)(1+b) ≈ 1 + a + b when a,b are small.
  • It operates on low‑precision formats (e.g., 8‑bit floats), approximating floating‑point multiplication via integer adds on exponent/mantissa bits, plus small correction terms.
  • Several note this is conceptually close to logarithmic number systems, fixed‑point/Q‑format arithmetic, and long‑used DSP/FPGA techniques.

Claims about precision and energy savings

  • The paper claims L‑Mul can match or beat 8‑bit FP (e4m3, e5m2) in precision and save up to ~95% energy for the multiplication operation, ~80% for dot products.
  • Multiple commenters emphasize this 95% is per multiply, not overall model power; inference is often memory‑bandwidth‑dominated, so real end‑to‑end gains would be much smaller.
  • An approximate‑computing researcher argues:
    • Much power is in data movement, not the arithmetic itself.
    • The paper’s accuracy comparison ignores standard “round to nearest even” in baseline FP, making the claimed superiority “non‑sensical.”
    • Reported attention‑accuracy results lack detail on scaling/accumulation, so are hard to trust.

Practical applicability and hardware implications

  • Consensus: this won’t remove the need for GPUs; parallelism for large models is still essential. It mainly targets more efficient inference and possibly training on suitably designed hardware.
  • Current GPUs/CPUs are not optimized for this; specialized accelerators could, in principle, exploit it. Some expect any real benefit would prompt hardware vendors to respond.
  • Debate on vendor impact:
    • Some foresee “bad news” for Nvidia; others note Nvidia could simply implement the scheme in CUDA and still win.
    • AMD’s ROCm and data‑center GPUs are discussed as partial alternatives but still trailing Nvidia in ecosystem maturity.

Experimentation and limitations

  • A hand‑written AVX‑512 L‑Mul approximation applied directly to a FP16 Llama model produced gibberish outputs, suggesting models must be trained specifically for this arithmetic and/or only some layers can use it.
  • One implementation (BitNet/bitnet.cpp) shows promising CPU speedups (≈1.4–6×) and 55–82% CPU energy reductions for certain 1‑bit/1.58‑bit models, but that is a different, though related, line of work.

Meta: hype, impact, and rebound effects

  • Multiple comments criticize clickbait headlines and stress that results are theoretical or narrow; call for real, system‑level benchmarks.
  • Some invoke Jevons paradox: more efficient AI may simply lead to far more AI usage, not less total energy.
  • There is broader side‑discussion on whether LLMs’ productivity gains justify their energy and cost, with both strong advocates and skeptics represented.