2024-12-13

New LLM optimization technique slashes memory costs

Scope of the Optimization

Technique targets KV cache / context window memory, not the base model weights.
Several commenters note the title “slashes memory costs” is misleading if read as total model VRAM; it’s specifically working memory for context.
For small models (1–8B), context RAM is often the main bottleneck in practice, so this still matters a lot for real workloads.

Relation to HeadKV and Similar Work

Compared with Microsoft’s HeadKV (claims ~98% KV memory reduction with ~97% performance retained).
Both operate on the KV cache, i.e., attention memory over past tokens.
NAMM paper explicitly describes using evolution to learn how to prune KV cache entries; commenters suggest these techniques might be composable, but that’s unproven.
One commenter stresses this is not like lossless compression: both methods drop information in (hopefully) performance-preserving ways.

How It Works (Conceptual)

KV cache holds hidden-state tensors (latent space), not raw tokens; each attention head and layer has its own cache.
NAMMs decide which token states to “remember” or forget, effectively acting as a learned lossy compressor or “boringness classifier” over context.
The method is trained separately and applied at inference to arbitrary transformers, potentially across modalities and tasks.
Intuition discussed: tokens that frequently receive attention across positions are more “important”; the method exploits this frequency structure in attention matrices.

Inference vs Training

Primary benefit is for inference, where KV caching dominates long-context cost.
Some clarification that training uses forward passes (like inference) plus backprop; KV optimizations could help certain training setups (e.g., RL with cached sequences), but that’s secondary.

Risks and Limitations

Being lossy, it can discard useful tokens; reliability concerns are raised.
It only reduces context memory, so it doesn’t enable fitting much larger base models on low-VRAM GPUs.

Broader Themes: Efficiency, Energy, and Future Optimizations

Several comments note LLMs are still highly inefficient; we’re early in the “compression/optimization era.”
Speculation that future algorithmic and hardware gains could massively shrink compute and memory needs, tying into “hardware overhang” worries.
Others argue Jevons paradox: efficiency gains will likely increase total AI compute and energy use, not decrease it.

Related topics