New LLM optimization technique slashes memory costs
Scope of the Optimization
- Technique targets KV cache / context window memory, not the base model weights.
- Several commenters note the title “slashes memory costs” is misleading if read as total model VRAM; it’s specifically working memory for context.
- For small models (1–8B), context RAM is often the main bottleneck in practice, so this still matters a lot for real workloads.
Relation to HeadKV and Similar Work
- Compared with Microsoft’s HeadKV (claims ~98% KV memory reduction with ~97% performance retained).
- Both operate on the KV cache, i.e., attention memory over past tokens.
- NAMM paper explicitly describes using evolution to learn how to prune KV cache entries; commenters suggest these techniques might be composable, but that’s unproven.
- One commenter stresses this is not like lossless compression: both methods drop information in (hopefully) performance-preserving ways.
How It Works (Conceptual)
- KV cache holds hidden-state tensors (latent space), not raw tokens; each attention head and layer has its own cache.
- NAMMs decide which token states to “remember” or forget, effectively acting as a learned lossy compressor or “boringness classifier” over context.
- The method is trained separately and applied at inference to arbitrary transformers, potentially across modalities and tasks.
- Intuition discussed: tokens that frequently receive attention across positions are more “important”; the method exploits this frequency structure in attention matrices.
Inference vs Training
- Primary benefit is for inference, where KV caching dominates long-context cost.
- Some clarification that training uses forward passes (like inference) plus backprop; KV optimizations could help certain training setups (e.g., RL with cached sequences), but that’s secondary.
Risks and Limitations
- Being lossy, it can discard useful tokens; reliability concerns are raised.
- It only reduces context memory, so it doesn’t enable fitting much larger base models on low-VRAM GPUs.
Broader Themes: Efficiency, Energy, and Future Optimizations
- Several comments note LLMs are still highly inefficient; we’re early in the “compression/optimization era.”
- Speculation that future algorithmic and hardware gains could massively shrink compute and memory needs, tying into “hardware overhang” worries.
- Others argue Jevons paradox: efficiency gains will likely increase total AI compute and energy use, not decrease it.