2025-02-24

DeepSeek Open Source FlashMLA – MLA Decoding Kernel for Hopper GPUs

Technical aspects of FlashMLA and performance

Kernel targets Hopper GPUs (H100/H800-class), using BF16 and paged KV cache (block size 64), claiming ~~3000 GB/s memory-bound and 580 TFLOPS compute-bound on H800 (~~90% bandwidth, ~60% compute efficiency).
NVLink is irrelevant to this kernel (single-GPU, no comms), but matters for multi-GPU training; H800’s reduced NVLink is a training/scale issue, not an inference one.
Only forward-pass (decoding) is released; some speculate the “real secret” may be in the backward pass or scheduler, and that heavy low-level optimization is less critical during training.
FlashMLA is positioned as a decoding-time optimization, complementary to FlashAttention/FlashDecoding. Debate centers on whether decoding is fundamentally memory-bound vs compute-bound; consensus leans toward memory-bound in realistic serving scenarios, especially with KV cache.
MLA is discussed as likely “Multi-head Latent Attention” and a successor to GQA: for the same KV cache footprint, MLA is theoretically more expressive, and standard GQA models can be converted. Some question whether this extra expressive power yields practical gains and note GQA can be simpler/faster.

Ecosystem impact and integrations

vLLM already supports MLA for DeepSeek models with reported ~3× generation throughput and ~10× token memory capacity vs its own prior releases; MHA can still win at very low QPS. SGLang is also rapidly improving, including on AMD GPUs.
Commenters expect vLLM/SGLang and major inference providers to integrate or match FlashMLA; for most individuals, this code mainly matters via such frameworks, not direct use.
MLA’s reduced KV size is seen as transformative for context length: a concrete example argues MLA can raise H100 context capacity from ~46k to ~640k tokens.

Hardware, sanctions, and supply chain

DeepSeek uses H800 (and possibly H20), export-limited Hopper variants allowed in China; no admission of illegal H100 use is implied.
Discussion around “smuggling” emphasizes: US law restricts US companies exporting to China, not Chinese firms buying elsewhere; Singapore is highlighted as a billing hub whose GPU import numbers don’t match Nvidia’s Singapore revenue.
Commenters debate whether sanctions are effective policy vs counterproductive (pushing self-reliance) and argue over the morality of re-exporting restricted hardware.

Open source strategy and career implications

Many see DeepSeek as “real” open AI: open-sourcing infra lowers costs for the ecosystem, hinders regulatory capture, and enables many competing services even if large clusters remain a barrier.
Others frame open-sourcing as a rational “runner-up” strategy to prevent the market from being locked by a single leader.
A thread develops around career strategy: some argue this sort of deeply optimized systems code is the new bar for “elite” programmers as AI eats higher-level work; others note such roles are few and that AI is already helping optimize low-level kernels (e.g., SIMD/CUDA PRs reportedly “99% written” by an LLM).
There is debate over whether “going lower in the stack” is a durable hedge against AI, with skepticism that narrow, well-defined domains will remain uniquely human for long, and analogies drawn to past “low-code” waves and leaky abstractions.

Related topics