New LLM optimization technique slashes memory costs

Scope of the Optimization

  • Technique targets KV cache / context window memory, not the base model weights.
  • Several commenters note the title “slashes memory costs” is misleading if read as total model VRAM; it’s specifically working memory for context.
  • For small models (1–8B), context RAM is often the main bottleneck in practice, so this still matters a lot for real workloads.

Relation to HeadKV and Similar Work

  • Compared with Microsoft’s HeadKV (claims ~98% KV memory reduction with ~97% performance retained).
  • Both operate on the KV cache, i.e., attention memory over past tokens.
  • NAMM paper explicitly describes using evolution to learn how to prune KV cache entries; commenters suggest these techniques might be composable, but that’s unproven.
  • One commenter stresses this is not like lossless compression: both methods drop information in (hopefully) performance-preserving ways.

How It Works (Conceptual)

  • KV cache holds hidden-state tensors (latent space), not raw tokens; each attention head and layer has its own cache.
  • NAMMs decide which token states to “remember” or forget, effectively acting as a learned lossy compressor or “boringness classifier” over context.
  • The method is trained separately and applied at inference to arbitrary transformers, potentially across modalities and tasks.
  • Intuition discussed: tokens that frequently receive attention across positions are more “important”; the method exploits this frequency structure in attention matrices.

Inference vs Training

  • Primary benefit is for inference, where KV caching dominates long-context cost.
  • Some clarification that training uses forward passes (like inference) plus backprop; KV optimizations could help certain training setups (e.g., RL with cached sequences), but that’s secondary.

Risks and Limitations

  • Being lossy, it can discard useful tokens; reliability concerns are raised.
  • It only reduces context memory, so it doesn’t enable fitting much larger base models on low-VRAM GPUs.

Broader Themes: Efficiency, Energy, and Future Optimizations

  • Several comments note LLMs are still highly inefficient; we’re early in the “compression/optimization era.”
  • Speculation that future algorithmic and hardware gains could massively shrink compute and memory needs, tying into “hardware overhang” worries.
  • Others argue Jevons paradox: efficiency gains will likely increase total AI compute and energy use, not decrease it.