Why DeepSeek is cheap at scale but expensive to run locally

DeepSeek’s Pricing vs Competitors

  • Commenters note DeepSeek is very cheap at scale but not 1/100th the price; more like 1/10–1/20 vs top US models, and more expensive than some budget options like Gemini Flash.
  • Many regard its efficiency as a genuine engineering achievement (MoE + batching), but point out that other providers can still be substantially more expensive per token.

Batching, MoE, and Non‑Determinism

  • Core explanation: large batches let providers amortize memory reads and keep tensor cores busy; this is crucial for MoE models, where only a subset of experts fire per token.
  • At small batch sizes (typical for local/single‑user), MoE loses much of its efficiency advantage, leading to poor FLOP utilization and high cost per token.
  • Several comments clarify that:
    • Attention and KV cache behave differently from dense MLP parts for batching.
    • Non‑determinism can arise from different kernel choices, parallelism, and MoE routing sensitivity to batch layout, even with fixed seed and temperature.
    • Requests in a batch should not semantically leak into each other, though some worry about this as a theoretical attack or implementation bug.

Local vs Cloud Inference

  • Running DeepSeek V3/R1 locally is seen as “expensive” mainly due to memory needs (hundreds of GB) and multi‑GPU requirements for good speed.
  • Some users run quantized variants on high‑RAM CPU servers (e.g., EPYC/Xeon with 256–768 GB RAM) at 7–10 tokens/s, acceptable for personal use but much slower than cloud and with limited context.
  • Others argue CPU‑only is poor “bang for buck” once prompts get large; a single strong GPU with a smaller dense model (e.g., ~20–30B) often yields a better interactive experience.
  • Apple Silicon and high‑HBM AMD GPUs are discussed as interesting fits for MoE and large models, but AMD’s software/driver maturity is heavily debated.

Privacy, Safety, and Propaganda Concerns

  • One participant claims ChatGPT exposed private GitHub repo contents; others strongly suspect hallucination and demand evidence. Alleged behavior is described as serious if true but unverified.
  • DeepSeek is reported by one user to enthusiastically support violent prompts framed in a revolutionary‑socialist context, raising concerns about state‑aligned propaganda and asymmetric safety tuning.
  • Broader worry that all LLMs will be powerful political‑messaging tools, regardless of country of origin.

Economics and “Rent‑Seeking” Debate

  • Some compare per‑token billing to telecom “minutes,” calling it extractive; others counter that huge capex and opex make this straightforward cost recovery, not rent‑seeking.
  • General expectation that current low prices are introductory and may rise once usage is entrenched and training costs grow.