2025-06-01

Why DeepSeek is cheap at scale but expensive to run locally

DeepSeek’s Pricing vs Competitors

Commenters note DeepSeek is very cheap at scale but not 1/100th the price; more like 1/10–1/20 vs top US models, and more expensive than some budget options like Gemini Flash.
Many regard its efficiency as a genuine engineering achievement (MoE + batching), but point out that other providers can still be substantially more expensive per token.

Batching, MoE, and Non‑Determinism

Core explanation: large batches let providers amortize memory reads and keep tensor cores busy; this is crucial for MoE models, where only a subset of experts fire per token.
At small batch sizes (typical for local/single‑user), MoE loses much of its efficiency advantage, leading to poor FLOP utilization and high cost per token.
Several comments clarify that:
- Attention and KV cache behave differently from dense MLP parts for batching.
- Non‑determinism can arise from different kernel choices, parallelism, and MoE routing sensitivity to batch layout, even with fixed seed and temperature.
- Requests in a batch should not semantically leak into each other, though some worry about this as a theoretical attack or implementation bug.

Local vs Cloud Inference

Running DeepSeek V3/R1 locally is seen as “expensive” mainly due to memory needs (hundreds of GB) and multi‑GPU requirements for good speed.
Some users run quantized variants on high‑RAM CPU servers (e.g., EPYC/Xeon with 256–768 GB RAM) at 7–10 tokens/s, acceptable for personal use but much slower than cloud and with limited context.
Others argue CPU‑only is poor “bang for buck” once prompts get large; a single strong GPU with a smaller dense model (e.g., ~20–30B) often yields a better interactive experience.
Apple Silicon and high‑HBM AMD GPUs are discussed as interesting fits for MoE and large models, but AMD’s software/driver maturity is heavily debated.

Privacy, Safety, and Propaganda Concerns

One participant claims ChatGPT exposed private GitHub repo contents; others strongly suspect hallucination and demand evidence. Alleged behavior is described as serious if true but unverified.
DeepSeek is reported by one user to enthusiastically support violent prompts framed in a revolutionary‑socialist context, raising concerns about state‑aligned propaganda and asymmetric safety tuning.
Broader worry that all LLMs will be powerful political‑messaging tools, regardless of country of origin.

Economics and “Rent‑Seeking” Debate

Some compare per‑token billing to telecom “minutes,” calling it extractive; others counter that huge capex and opex make this straightforward cost recovery, not rent‑seeking.
General expectation that current low prices are introductory and may rise once usage is entrenched and training costs grow.

Related topics