A few words on DS4
What DS4 / DwarfStar4 Is
- Small, model-specific inference runtime focused on running DeepSeek V4 Flash locally.
- Optimized for Apple Metal and NVIDIA (esp. DGX Spark); ROCm support exists in a separate community-maintained branch.
- Derived ideas and some kernels from llama.cpp/GGML but aims to be a tightly scoped, vertically integrated implementation just for this model.
- KV cache and long-context handling are first-class concerns; project is evolving quickly with many PRs and active filtering of low-quality contributions.
Hardware Requirements & Performance
- Typical reported setup: 96–128 GB unified memory on recent Apple Silicon (M4/M5), or high-end NVIDIA GPUs (e.g., RTX 6000, 3090–5090 class).
- Memory footprint for Q2-ish quant is ~80 GB; leaving room for KV cache and other apps on a 128 GB Mac.
- Token speeds vary widely by hardware:
- Apple M5: generation ~30 t/s; prefill figures are contentious, with claims ranging from ~30 t/s (small prompt) up to ~400 t/s on more realistic prompts.
- RTX Pro 6000: prefill >100 t/s, generation ~50 t/s reported for similar DeepSeek-V4 quant.
- Several comments warn that slow prefill makes agentic use (large contexts, tool traces) painful on slower setups.
- Running on sub-96 GB machines may be technically possible via disk offload but expected to be “way slower.”
Quality, Use Cases, and Comparisons
- Multiple users report DS4 / DeepSeek V4 Flash as:
- Very strong at coding and tool use.
- Surprisingly good long-context reasoning (100k+ tokens) without obvious degradation.
- Competitive enough that some have replaced other “flash” or mid-tier frontier models for personal coding and learning.
- Tool-calling reliability and interleaved “thinking” traces are highlighted as strengths.
- Some OSS quantizations on third-party backends (e.g., OpenRouter) appear buggy or poorly configured, causing syntax errors; DS4’s own imatrix Q2 quant is reported as better.
- Comparisons:
- DeepSeek V4 Pro sometimes beats popular frontier coding models in anecdotal tests but is slower; current promo pricing makes it very cheap per token.
- Benchmarks and one agent framework show DeepSeek V4 Flash/Pro performing well but still behind top proprietary models in difficult coding/agent tasks.
- Dense ~27–30B models (e.g., Qwen 3.6, Nemotron) at higher bit depths may offer better quality per unit VRAM for some GPU setups; DS4’s MoE at 2-bit trades memory for capacity.
Design Choices vs. llama.cpp and Other Runtimes
- Some question why not extend llama.cpp instead of a new engine.
- Arguments for a standalone C codebase:
- Easier to aggressively specialize and iterate (including using LLMs to generate/optimize code guarded by tests/benchmarks).
- Simpler, narrower code is easier to reason about than a mature, generic C++ stack.
- Llama.cpp maintainers avoid PRs primarily written by AI, which blocks straightforward upstreaming.
- UX and “batteries included” defaults (known-good quant, one model) are seen as a key differentiator vs. knob-heavy generic tools.
Local vs. Cloud and Future Trajectory
- Thread repeatedly contrasts:
- Local benefits: privacy, lower marginal cost, offline use, control over stack.
- Cloud benefits: faster prefill and throughput, larger and smarter models, no hardware spend.
- Some see DS4-style setups on ~$5–6k machines as evidence the “genie” won’t go back in the bottle even if cloud frontier models become more expensive or restricted.
- Ongoing debate about when “good enough” local intelligence for coding/agents will saturate:
- One view: smaller/cheaper models, run longer or in ensembles, may cover most real-world tasks, reducing demand for frontier APIs.
- Counterpoint: hardest problems will always reward more memory and compute, preserving a niche for large datacenter models.
Skepticism and Open Questions
- Concerns about:
- Latency for serious agentic workflows on Mac-class hardware.
- Fragmentation of developer effort across multiple specialized runtimes.
- Over-enthusiasm and lack of rigorous personal benchmarking; arguments around what counts as “empirical.”
- Some reports of issues with other DeepSeek quantizations on vLLM (e.g., looping generations), but not clearly attributable to DS4 itself.
- Unclear how DS4 scales down on 32–48 GB machines or with heavy disk offload; several commenters want real data here.
Naming Confusion and Miscellany
- Many initially misread “DS4” as Dark Souls 4, DualShock 4, or a car model; illustrates how niche LLM terminology still is outside specialized circles.