2025-06-10

Magistral — the first reasoning model by Mistral AI

Model performance, size, and benchmarks

Small (24B) Magistral is seen as very efficient relative to DeepSeek V3 (671B total / 37B active), with strong math/logic scores, especially under majority voting.
Medium’s parameter count isn’t disclosed; some speculate it’s ~70B based on past leaks, but this is unconfirmed.
Many commenters note Magistral loses to DeepSeek-R1 on one‑shot benchmarks, and that Mistral compares against older R1 numbers rather than the stronger R1‑0528 release; this is viewed by some as selective or “outdated on release”.
Several people wish Magistral had been compared to Qwen3 (especially Qwen3‑30B‑A3B) and o3/o4‑mini, arguing those are current reasoning SOTA in the same compute band.

Training method and RL details

Discussion dives into the Magistral paper: a GRPO variant with:
- KL penalty effectively removed (β=0),
- length normalization of rewards,
- minibatch advantage normalization,
- relaxed trust region.
Some see dropping KL as a current “trend” without strong justification; others say KL can overly constrain learning from the base checkpoint.
Questions are raised about the theoretical motivation and real benefit of minibatch advantage normalization; answers in-thread remain inconclusive.
Magistral uses SFT + RL; commenters note this often outperforms pure-RL models.

Local deployment and tools

Community GGUF builds are available and run on llama.cpp and Ollama; people share configs (quantization levels, jinja templates, context sizes).
Magistral Small can run on a 4090 or 32GB Mac after quantization; some run it on older GPUs (e.g., 2080 Ti) and CPUs, trading speed vs hallucinations.
Tool calling is not yet wired up for the released Small GGUF; others point to Devstral (tool+code finetune) and ongoing work to add tools+thinking in Ollama.

Reasoning behavior and “thinking” debate

Some users find Magistral “overcooked”: heavy \boxed{} formatting, very long traces, and it may forget to think without the prescribed system prompt.
The Hitler’s mother example shows the model “thinking” in an extremely repetitive loop over a trivial fact—seen as characteristic of reasoning RL gone too far.
Large subthread debates whether LLM “thinking”/“reasoning” is real or just statistical token prediction:
- One side insists anthropomorphic terms mislead laypeople and overclaim capability; cites recent “illusion of reasoning/thinking” papers.
- Others argue “thinking” is a term of art for chain-of-thought; humans also fail, are inconsistent, and misreport their internal state, so these critiques don’t clearly separate humans from LLMs.
- Meta‑point: terminology shapes expectations and downstream misuse.

Speed vs quality, and real-world use

Many praise Mistral’s latency: responses often arrive several times faster than major competitors on non‑web tasks; some view speed as Mistral’s real edge.
One team reports swapping o4‑mini for Magistral‑Medium in a JSON-heavy feature: latency drops from ~50–70s to ~34–37s with slightly worse but acceptable quality.
Others counter that for deep research or coding, 4 tokens/s “reasoning” can be painful; speed matters most when long chains of thought or tool use are involved.

Comparisons to other open reasoning models

DeepSeek‑R1 (full and distills), Qwen3 reasoning variants, and Phi‑4 Reasoning are repeatedly cited as the main open-weight competitors.
Some see Qwen3‑30B‑A3B as the best “local” reasoning model today; Qwen3‑4B reportedly approaches, and sometimes beats, Magistral‑24B on shared benchmarks.
Several note Magistral’s advantage is being Apache‑licensed and small enough to run widely, even if raw reasoning scores lag Qwen/DeepSeek in some regimes.

Benchmarks, marketing, and transparency

Benchmark selection is criticized as narrow (mostly DeepSeek + Mistral baselines, few mainstream evals like MMLU‑Pro or LiveBench).
Some frame this as typical “marketing-driven” cherry-picking; others say small labs can’t afford to run every new baseline for every release.
Users appreciate fully visible reasoning traces and see them as valuable for auditability and business adoption—despite research showing trace correctness doesn’t always imply answer correctness.

EU vs US/China ecosystem digression

Long meta‑thread uses Magistral vs DeepSeek as a springboard into:
- EU regulation (cookies, privacy, AI rules), and whether it hinders innovation,
- funding scarcity vs US megacorps and VC,
- protectionism vs open markets (China as a counterexample),
- quality of life vs “move fast & break things” economies.
Some argue Mistral is symbolically important for EU AI sovereignty even if it trails SOTA; others note its cap table is heavily non‑European.

Other observations and criticisms

Style: Mistral’s announcement overuses em‑dashes; some like the voice, others find it distracting or “LLM-ish.”
OCR: a previous Mistral OCR model badly disappointed at least one user vs classic tools, leading to skepticism about current marketing claims.
Ideological bias: one commenter reports Magistral sometimes gives more balanced answers on politically charged Wikipedia‑shaped topics than other models.
Tooling UX: Ollama’s defaults (distilled models, small contexts, naming) draw criticism; some recommend using llama.cpp directly for serious local experimentation.

Related topics