A few words on DS4

What DS4 / DwarfStar4 Is

  • Small, model-specific inference runtime focused on running DeepSeek V4 Flash locally.
  • Optimized for Apple Metal and NVIDIA (esp. DGX Spark); ROCm support exists in a separate community-maintained branch.
  • Derived ideas and some kernels from llama.cpp/GGML but aims to be a tightly scoped, vertically integrated implementation just for this model.
  • KV cache and long-context handling are first-class concerns; project is evolving quickly with many PRs and active filtering of low-quality contributions.

Hardware Requirements & Performance

  • Typical reported setup: 96–128 GB unified memory on recent Apple Silicon (M4/M5), or high-end NVIDIA GPUs (e.g., RTX 6000, 3090–5090 class).
  • Memory footprint for Q2-ish quant is ~80 GB; leaving room for KV cache and other apps on a 128 GB Mac.
  • Token speeds vary widely by hardware:
    • Apple M5: generation ~30 t/s; prefill figures are contentious, with claims ranging from ~30 t/s (small prompt) up to ~400 t/s on more realistic prompts.
    • RTX Pro 6000: prefill >100 t/s, generation ~50 t/s reported for similar DeepSeek-V4 quant.
  • Several comments warn that slow prefill makes agentic use (large contexts, tool traces) painful on slower setups.
  • Running on sub-96 GB machines may be technically possible via disk offload but expected to be “way slower.”

Quality, Use Cases, and Comparisons

  • Multiple users report DS4 / DeepSeek V4 Flash as:
    • Very strong at coding and tool use.
    • Surprisingly good long-context reasoning (100k+ tokens) without obvious degradation.
    • Competitive enough that some have replaced other “flash” or mid-tier frontier models for personal coding and learning.
  • Tool-calling reliability and interleaved “thinking” traces are highlighted as strengths.
  • Some OSS quantizations on third-party backends (e.g., OpenRouter) appear buggy or poorly configured, causing syntax errors; DS4’s own imatrix Q2 quant is reported as better.
  • Comparisons:
    • DeepSeek V4 Pro sometimes beats popular frontier coding models in anecdotal tests but is slower; current promo pricing makes it very cheap per token.
    • Benchmarks and one agent framework show DeepSeek V4 Flash/Pro performing well but still behind top proprietary models in difficult coding/agent tasks.
    • Dense ~27–30B models (e.g., Qwen 3.6, Nemotron) at higher bit depths may offer better quality per unit VRAM for some GPU setups; DS4’s MoE at 2-bit trades memory for capacity.

Design Choices vs. llama.cpp and Other Runtimes

  • Some question why not extend llama.cpp instead of a new engine.
  • Arguments for a standalone C codebase:
    • Easier to aggressively specialize and iterate (including using LLMs to generate/optimize code guarded by tests/benchmarks).
    • Simpler, narrower code is easier to reason about than a mature, generic C++ stack.
    • Llama.cpp maintainers avoid PRs primarily written by AI, which blocks straightforward upstreaming.
    • UX and “batteries included” defaults (known-good quant, one model) are seen as a key differentiator vs. knob-heavy generic tools.

Local vs. Cloud and Future Trajectory

  • Thread repeatedly contrasts:
    • Local benefits: privacy, lower marginal cost, offline use, control over stack.
    • Cloud benefits: faster prefill and throughput, larger and smarter models, no hardware spend.
  • Some see DS4-style setups on ~$5–6k machines as evidence the “genie” won’t go back in the bottle even if cloud frontier models become more expensive or restricted.
  • Ongoing debate about when “good enough” local intelligence for coding/agents will saturate:
    • One view: smaller/cheaper models, run longer or in ensembles, may cover most real-world tasks, reducing demand for frontier APIs.
    • Counterpoint: hardest problems will always reward more memory and compute, preserving a niche for large datacenter models.

Skepticism and Open Questions

  • Concerns about:
    • Latency for serious agentic workflows on Mac-class hardware.
    • Fragmentation of developer effort across multiple specialized runtimes.
    • Over-enthusiasm and lack of rigorous personal benchmarking; arguments around what counts as “empirical.”
  • Some reports of issues with other DeepSeek quantizations on vLLM (e.g., looping generations), but not clearly attributable to DS4 itself.
  • Unclear how DS4 scales down on 32–48 GB machines or with heavy disk offload; several commenters want real data here.

Naming Confusion and Miscellany

  • Many initially misread “DS4” as Dark Souls 4, DualShock 4, or a car model; illustrates how niche LLM terminology still is outside specialized circles.