DeepSeek-R1

Model capabilities & benchmarks

  • Many commenters impressed by DeepSeek-R1’s math/coding benchmarks; some say small distilled models (7B–8B) approach or beat GPT-4/Claude 3.5 on specific tests, especially math and LeetCode-like coding.
  • Strong skepticism that an 8B model is truly “Sonnet-class” in broad capability; several note this likely reflects benchmark narrowness or overfitting.
  • Some users who tried the API/models report R1 is very strong on structured reasoning, math, and algorithmic problems, weaker and more erratic on general “real-world” use.

Reasoning behavior & limitations

  • The exposed “thinking” traces are a major point of fascination; people like seeing the chain-of-thought, and compare it to o1’s hidden reasoning.
  • Multiple “strawberry” / letter-counting and simple puzzle tests show:
    • It can sometimes reason correctly, yet override correct reasoning with incorrect “gut” priors.
    • It often overthinks, loops, or doubts itself.
  • Several note that tokenization and lack of character-level modeling make spelling/letter-count tasks inherently awkward.
  • Some report the models are verbose, rambling, and slow for interactive coding/chat, though great for deep one-shot problems.

Training, RL, and distillation

  • Highlighted as important: R1 uses a pipeline with RL-only reasoning discovery (no SFT in the core stage), then RL alignment, then distillation into smaller Qwen/Llama models.
  • Commenters see this as a proof that pure RL can induce reasoning patterns, especially in “closed” domains with clear rewards (math, tests, code).
  • Distilled models (1.5B–70B) seem to carry over much of the reasoning, with 7B–14B seen as a sweet spot for local use.

Local deployment & hardware

  • GGUF quantized models are already available; many report success with:
    • 7B/8B on laptops, M-series Macs, and modest GPUs.
    • 32B/70B on high-RAM desktops or heavy quantization, with slower throughput.
  • Tools mentioned: Ollama, llama.cpp, LM Studio, Open WebUI, various HF Spaces.

Reliability, censorship & safety

  • Several say DeepSeek models feel less reliable than GPT-4o/Claude for day-to-day coding or ambiguous tasks; benchmarks don’t fully capture “trustworthiness.”
  • Cloud version is heavily censored on Chinese political topics; local open-weight models can be less restricted, though some safety tuning remains.
  • Concerns raised about hosted APIs training on user data; open weights mitigate this when run locally.

Open-source, geopolitics & business impact

  • MIT-licensed weights and permissive commercial use seen as a direct challenge to closed US labs.
  • Some frame this as part of a Chinese national strategy and as sanctions “backfiring.”
  • Others stress that DeepSeek, like Mistral etc., stands on prior open research from big US/EU labs, but still does impressive “fast follow” engineering.