DeepSeek-v3.2: Pushing the frontier of open large language models [pdf]
Model performance & technical approach
- Commenters describe DeepSeek‑V3.2 and especially the “Speciale” reasoning checkpoint as frontier‑level, with claims (from the paper/marketing) of surpassing GPT‑5 on some reasoning benchmarks and matching Gemini 3.0.
- Benchmarks in the paper show DeepSeek‑Speciale consistently near the top, but with much longer outputs; people note they are explicitly trading latency and cost for maximum benchmark scores via extended “thinking” traces.
- Technically, the big novelty is the new sparse attention scheme (DeepSeek Sparse Attention, DSA) plus heavy RL-based post‑training for reasoning and agentic behavior, all described in detail and released as code.
- Some see the benchmark race as increasingly marginal (1–2% at the top) and warn that many benchmarks are saturated or gamable.
Inference efficiency, speed & hardware
- DeepSeek is praised as dramatically cheaper per token than US frontier APIs, making “crank up the thinking” strategies viable.
- Real‑world speeds via OpenRouter and other providers are mixed: some report DeepSeek/V3.2 slower than Claude/GPT/Gemini, others point to very fast deployments (e.g., GLM 4.6 on Cerebras).
- Running the full 685B MoE locally is possible but slow; people discuss Mac Studio 512GB, multi‑GPU rigs, and CPU+RAM builds where 10–20 tok/s is considered borderline but acceptable for some use.
- Many agree truly large models mainly make sense on cloud or specialist providers; smaller distilled / MoE variants (Qwen, GLM, etc.) are preferred for home rigs.
Open weights, ecosystem & tooling
- The model is MIT‑licensed and open‑weights; several view this as a major counterweight to proprietary US labs and a way to erode their valuation moats.
- Open models enable local deployment, multi‑provider choice, reproducibility, and jurisdictional control, which some enterprises and researchers value highly.
- Tool‑calling and agentic capabilities are still seen as weaker than Claude; DeepSeek‑V3.2 is positioned more as an architectural/RL experiment than a tool‑calling workhorse.
- Some complain about DeepSeek’s unstable model IDs and opaque versioning on the hosted API, preferring pinned versions and date‑tagged IDs.
China vs US: geopolitics, trust & censorship
- A long subthread debates why Chinese labs are releasing strong open models while US labs lock down: suggested reasons include Western “safety”/IP concerns vs China’s desire to undercut US AI dominance.
- Several predict US restrictions on using Chinese models in corporations or government, comparing to chip and telecom bans.
- Enterprise consultants report strong resistance to anything “China‑linked,” regardless of hosting location, while others note that some big firms (e.g., in hospitality) are already adopting Chinese models for customer service.
- There is debate over state subsidies and strategic dumping vs simple technical efficiency; some see parallels with rare‑earths and other industries, others point to comparable US subsidies and hype.
Safety, alignment & censorship
- Some users find Chinese models more “censored” on politically sensitive questions even when run fully locally, implying the filter is in the weights, not just the UI.
- Others argue that any useful instruction‑following model necessarily reflects the values of its trainers and is “censored” by design; the alternative is an unhelpful raw text predictor.
UX, “vibes” & real-world performance
- Experiences are split: some say earlier Chinese models (Kimi, older DeepSeek) benchmarked well but felt brittle or overfit; others report DeepSeek V3.x, Kimi K2 Thinking, GLM 4.6 and Qwen as excellent in daily coding and reasoning work.
- “Vibe testing” (subjective feel, helpfulness, style) often diverges from benchmark rankings; several note that Claude and GPT have smoother UX and memory, while open Chinese models increasingly win on raw capability and cost.