2025-05-21

Devstral

Performance & first impressions

Several users report Devstral is “impressive” for local coding assistance, handling tricky language-specific tasks (e.g., Ruby/RSpec) and large-context editing through tools like aider or Cline.
Others find it underwhelming or “atrocious” for file reading and tool calling when wired into generic agent frameworks, suggesting quality is highly setup-dependent.

Local deployment & hardware

Runs on a range of hardware: RTX 4090/4080, 3090, 6800 XT with 64GB RAM, and Apple Silicon (M2/M4 Air/Max, 24–128GB).
On underpowered setups (e.g., 8–12GB GPU, 16GB RAM Mac), it may technically run but be very slow or cause swapping/freezes.
Ollama’s 14GB model size is used as a rough proxy for RAM needs; rule of thumb: model size + a few GB for context. Below ~20GB tends to coexist better with other apps on macOS.
First-token latency can be ~1 minute on high-end Macs with large context, then responses are much faster.

Tool use and agent workflows

Devstral appears strongly tuned for a specific agent framework (OpenHands / cradle-like flows: search_repo, read_file, run_tests, etc.), excelling when used as part of that stack.
Multiple reports say generic tool-calling “hello world” tests fail: the model doesn’t reliably call arbitrary tools or use their results.
Some users report good agentic behavior in Cline and OpenHands; others cannot get tools to trigger at all in their own systems. This mismatch is a major point of confusion.

Benchmarks and trust

SWE-Bench-Verified results are described as extraordinarily high for an open model of this size, even rivaling or beating some Claude/agent setups.
Several commenters are skeptical, suspecting heavy optimization for that benchmark or for a specific toolchain, and note that single benchmark numbers increasingly diverge from their real-world experience.
One user finds Devstral clearly worse than qwen3:30b on nontrivial Clojure tasks; others emphasize it’s not optimized for “write a function that does X” but for multi-step agent flows.

Model comparisons & use cases

Compared against Claude 3.7 and other hosted LLMs, many see Devstral as a “different class”: weaker raw capability but attractive for privacy, offline use, cost, and “doing the thinking yourself.”
Users mention Qwen, Gemma 3, GLM4, and various Q4 quantizations as alternatives; no consensus “best local” model, and performance often seems language-/task-dependent.

Licensing, openness, and strategy

Apache 2.0 licensing is widely praised versus restrictive “open weight” or Llama-style licenses. Some note Mistral has a strong open-weight history, though not all their models (e.g., Codestral) are open.
There is support for EU/public funding of Apache/MIT-licensed models as a strategic counterweight to big US/Chinese providers; Mistral is viewed by some as a promising “independent European alternative.”
A broader concern is that smaller model vendors should lean into open-source tooling (Aider, OpenHands, etc.) rather than building closed, fully autonomous agents, which many still see as premature and unreliable compared to assisted coding flows.

Related topics