2026-05-14

A few words on DS4

What DS4 / DwarfStar4 Is

Small, model-specific inference runtime focused on running DeepSeek V4 Flash locally.
Optimized for Apple Metal and NVIDIA (esp. DGX Spark); ROCm support exists in a separate community-maintained branch.
Derived ideas and some kernels from llama.cpp/GGML but aims to be a tightly scoped, vertically integrated implementation just for this model.
KV cache and long-context handling are first-class concerns; project is evolving quickly with many PRs and active filtering of low-quality contributions.

Hardware Requirements & Performance

Typical reported setup: 96–128 GB unified memory on recent Apple Silicon (M4/M5), or high-end NVIDIA GPUs (e.g., RTX 6000, 3090–5090 class).
Memory footprint for Q2-ish quant is ~80 GB; leaving room for KV cache and other apps on a 128 GB Mac.
Token speeds vary widely by hardware:
- Apple M5: generation ~30 t/s; prefill figures are contentious, with claims ranging from ~30 t/s (small prompt) up to ~400 t/s on more realistic prompts.
- RTX Pro 6000: prefill >100 t/s, generation ~50 t/s reported for similar DeepSeek-V4 quant.
Several comments warn that slow prefill makes agentic use (large contexts, tool traces) painful on slower setups.
Running on sub-96 GB machines may be technically possible via disk offload but expected to be “way slower.”

Quality, Use Cases, and Comparisons

Multiple users report DS4 / DeepSeek V4 Flash as:
- Very strong at coding and tool use.
- Surprisingly good long-context reasoning (100k+ tokens) without obvious degradation.
- Competitive enough that some have replaced other “flash” or mid-tier frontier models for personal coding and learning.
Tool-calling reliability and interleaved “thinking” traces are highlighted as strengths.
Some OSS quantizations on third-party backends (e.g., OpenRouter) appear buggy or poorly configured, causing syntax errors; DS4’s own imatrix Q2 quant is reported as better.
Comparisons:
- DeepSeek V4 Pro sometimes beats popular frontier coding models in anecdotal tests but is slower; current promo pricing makes it very cheap per token.
- Benchmarks and one agent framework show DeepSeek V4 Flash/Pro performing well but still behind top proprietary models in difficult coding/agent tasks.
- Dense ~27–30B models (e.g., Qwen 3.6, Nemotron) at higher bit depths may offer better quality per unit VRAM for some GPU setups; DS4’s MoE at 2-bit trades memory for capacity.

Design Choices vs. llama.cpp and Other Runtimes

Some question why not extend llama.cpp instead of a new engine.
Arguments for a standalone C codebase:
- Easier to aggressively specialize and iterate (including using LLMs to generate/optimize code guarded by tests/benchmarks).
- Simpler, narrower code is easier to reason about than a mature, generic C++ stack.
- Llama.cpp maintainers avoid PRs primarily written by AI, which blocks straightforward upstreaming.
- UX and “batteries included” defaults (known-good quant, one model) are seen as a key differentiator vs. knob-heavy generic tools.

Local vs. Cloud and Future Trajectory

Thread repeatedly contrasts:
- Local benefits: privacy, lower marginal cost, offline use, control over stack.
- Cloud benefits: faster prefill and throughput, larger and smarter models, no hardware spend.
Some see DS4-style setups on ~$5–6k machines as evidence the “genie” won’t go back in the bottle even if cloud frontier models become more expensive or restricted.
Ongoing debate about when “good enough” local intelligence for coding/agents will saturate:
- One view: smaller/cheaper models, run longer or in ensembles, may cover most real-world tasks, reducing demand for frontier APIs.
- Counterpoint: hardest problems will always reward more memory and compute, preserving a niche for large datacenter models.

Skepticism and Open Questions

Concerns about:
- Latency for serious agentic workflows on Mac-class hardware.
- Fragmentation of developer effort across multiple specialized runtimes.
- Over-enthusiasm and lack of rigorous personal benchmarking; arguments around what counts as “empirical.”
Some reports of issues with other DeepSeek quantizations on vLLM (e.g., looping generations), but not clearly attributable to DS4 itself.
Unclear how DS4 scales down on 32–48 GB machines or with heavy disk offload; several commenters want real data here.

Naming Confusion and Miscellany

Many initially misread “DS4” as Dark Souls 4, DualShock 4, or a car model; illustrates how niche LLM terminology still is outside specialized circles.

Related topics