DeepSeek-R1
Model capabilities & benchmarks
- Many commenters impressed by DeepSeek-R1’s math/coding benchmarks; some say small distilled models (7B–8B) approach or beat GPT-4/Claude 3.5 on specific tests, especially math and LeetCode-like coding.
- Strong skepticism that an 8B model is truly “Sonnet-class” in broad capability; several note this likely reflects benchmark narrowness or overfitting.
- Some users who tried the API/models report R1 is very strong on structured reasoning, math, and algorithmic problems, weaker and more erratic on general “real-world” use.
Reasoning behavior & limitations
- The exposed “thinking” traces are a major point of fascination; people like seeing the chain-of-thought, and compare it to o1’s hidden reasoning.
- Multiple “strawberry” / letter-counting and simple puzzle tests show:
- It can sometimes reason correctly, yet override correct reasoning with incorrect “gut” priors.
- It often overthinks, loops, or doubts itself.
- Several note that tokenization and lack of character-level modeling make spelling/letter-count tasks inherently awkward.
- Some report the models are verbose, rambling, and slow for interactive coding/chat, though great for deep one-shot problems.
Training, RL, and distillation
- Highlighted as important: R1 uses a pipeline with RL-only reasoning discovery (no SFT in the core stage), then RL alignment, then distillation into smaller Qwen/Llama models.
- Commenters see this as a proof that pure RL can induce reasoning patterns, especially in “closed” domains with clear rewards (math, tests, code).
- Distilled models (1.5B–70B) seem to carry over much of the reasoning, with 7B–14B seen as a sweet spot for local use.
Local deployment & hardware
- GGUF quantized models are already available; many report success with:
- 7B/8B on laptops, M-series Macs, and modest GPUs.
- 32B/70B on high-RAM desktops or heavy quantization, with slower throughput.
- Tools mentioned: Ollama, llama.cpp, LM Studio, Open WebUI, various HF Spaces.
Reliability, censorship & safety
- Several say DeepSeek models feel less reliable than GPT-4o/Claude for day-to-day coding or ambiguous tasks; benchmarks don’t fully capture “trustworthiness.”
- Cloud version is heavily censored on Chinese political topics; local open-weight models can be less restricted, though some safety tuning remains.
- Concerns raised about hosted APIs training on user data; open weights mitigate this when run locally.
Open-source, geopolitics & business impact
- MIT-licensed weights and permissive commercial use seen as a direct challenge to closed US labs.
- Some frame this as part of a Chinese national strategy and as sanctions “backfiring.”
- Others stress that DeepSeek, like Mistral etc., stands on prior open research from big US/EU labs, but still does impressive “fast follow” engineering.