Why DeepSeek is cheap at scale but expensive to run locally
DeepSeek’s Pricing vs Competitors
- Commenters note DeepSeek is very cheap at scale but not 1/100th the price; more like 1/10–1/20 vs top US models, and more expensive than some budget options like Gemini Flash.
- Many regard its efficiency as a genuine engineering achievement (MoE + batching), but point out that other providers can still be substantially more expensive per token.
Batching, MoE, and Non‑Determinism
- Core explanation: large batches let providers amortize memory reads and keep tensor cores busy; this is crucial for MoE models, where only a subset of experts fire per token.
- At small batch sizes (typical for local/single‑user), MoE loses much of its efficiency advantage, leading to poor FLOP utilization and high cost per token.
- Several comments clarify that:
- Attention and KV cache behave differently from dense MLP parts for batching.
- Non‑determinism can arise from different kernel choices, parallelism, and MoE routing sensitivity to batch layout, even with fixed seed and temperature.
- Requests in a batch should not semantically leak into each other, though some worry about this as a theoretical attack or implementation bug.
Local vs Cloud Inference
- Running DeepSeek V3/R1 locally is seen as “expensive” mainly due to memory needs (hundreds of GB) and multi‑GPU requirements for good speed.
- Some users run quantized variants on high‑RAM CPU servers (e.g., EPYC/Xeon with 256–768 GB RAM) at 7–10 tokens/s, acceptable for personal use but much slower than cloud and with limited context.
- Others argue CPU‑only is poor “bang for buck” once prompts get large; a single strong GPU with a smaller dense model (e.g., ~20–30B) often yields a better interactive experience.
- Apple Silicon and high‑HBM AMD GPUs are discussed as interesting fits for MoE and large models, but AMD’s software/driver maturity is heavily debated.
Privacy, Safety, and Propaganda Concerns
- One participant claims ChatGPT exposed private GitHub repo contents; others strongly suspect hallucination and demand evidence. Alleged behavior is described as serious if true but unverified.
- DeepSeek is reported by one user to enthusiastically support violent prompts framed in a revolutionary‑socialist context, raising concerns about state‑aligned propaganda and asymmetric safety tuning.
- Broader worry that all LLMs will be powerful political‑messaging tools, regardless of country of origin.
Economics and “Rent‑Seeking” Debate
- Some compare per‑token billing to telecom “minutes,” calling it extractive; others counter that huge capex and opex make this straightforward cost recovery, not rent‑seeking.
- General expectation that current low prices are introductory and may rise once usage is entrenched and training costs grow.