Devstral

Performance & first impressions

  • Several users report Devstral is “impressive” for local coding assistance, handling tricky language-specific tasks (e.g., Ruby/RSpec) and large-context editing through tools like aider or Cline.
  • Others find it underwhelming or “atrocious” for file reading and tool calling when wired into generic agent frameworks, suggesting quality is highly setup-dependent.

Local deployment & hardware

  • Runs on a range of hardware: RTX 4090/4080, 3090, 6800 XT with 64GB RAM, and Apple Silicon (M2/M4 Air/Max, 24–128GB).
  • On underpowered setups (e.g., 8–12GB GPU, 16GB RAM Mac), it may technically run but be very slow or cause swapping/freezes.
  • Ollama’s 14GB model size is used as a rough proxy for RAM needs; rule of thumb: model size + a few GB for context. Below ~20GB tends to coexist better with other apps on macOS.
  • First-token latency can be ~1 minute on high-end Macs with large context, then responses are much faster.

Tool use and agent workflows

  • Devstral appears strongly tuned for a specific agent framework (OpenHands / cradle-like flows: search_repo, read_file, run_tests, etc.), excelling when used as part of that stack.
  • Multiple reports say generic tool-calling “hello world” tests fail: the model doesn’t reliably call arbitrary tools or use their results.
  • Some users report good agentic behavior in Cline and OpenHands; others cannot get tools to trigger at all in their own systems. This mismatch is a major point of confusion.

Benchmarks and trust

  • SWE-Bench-Verified results are described as extraordinarily high for an open model of this size, even rivaling or beating some Claude/agent setups.
  • Several commenters are skeptical, suspecting heavy optimization for that benchmark or for a specific toolchain, and note that single benchmark numbers increasingly diverge from their real-world experience.
  • One user finds Devstral clearly worse than qwen3:30b on nontrivial Clojure tasks; others emphasize it’s not optimized for “write a function that does X” but for multi-step agent flows.

Model comparisons & use cases

  • Compared against Claude 3.7 and other hosted LLMs, many see Devstral as a “different class”: weaker raw capability but attractive for privacy, offline use, cost, and “doing the thinking yourself.”
  • Users mention Qwen, Gemma 3, GLM4, and various Q4 quantizations as alternatives; no consensus “best local” model, and performance often seems language-/task-dependent.

Licensing, openness, and strategy

  • Apache 2.0 licensing is widely praised versus restrictive “open weight” or Llama-style licenses. Some note Mistral has a strong open-weight history, though not all their models (e.g., Codestral) are open.
  • There is support for EU/public funding of Apache/MIT-licensed models as a strategic counterweight to big US/Chinese providers; Mistral is viewed by some as a promising “independent European alternative.”
  • A broader concern is that smaller model vendors should lean into open-source tooling (Aider, OpenHands, etc.) rather than building closed, fully autonomous agents, which many still see as premature and unreliable compared to assisted coding flows.