Magistral — the first reasoning model by Mistral AI
Model performance, size, and benchmarks
- Small (24B) Magistral is seen as very efficient relative to DeepSeek V3 (671B total / 37B active), with strong math/logic scores, especially under majority voting.
- Medium’s parameter count isn’t disclosed; some speculate it’s ~70B based on past leaks, but this is unconfirmed.
- Many commenters note Magistral loses to DeepSeek-R1 on one‑shot benchmarks, and that Mistral compares against older R1 numbers rather than the stronger R1‑0528 release; this is viewed by some as selective or “outdated on release”.
- Several people wish Magistral had been compared to Qwen3 (especially Qwen3‑30B‑A3B) and o3/o4‑mini, arguing those are current reasoning SOTA in the same compute band.
Training method and RL details
- Discussion dives into the Magistral paper: a GRPO variant with:
- KL penalty effectively removed (β=0),
- length normalization of rewards,
- minibatch advantage normalization,
- relaxed trust region.
- Some see dropping KL as a current “trend” without strong justification; others say KL can overly constrain learning from the base checkpoint.
- Questions are raised about the theoretical motivation and real benefit of minibatch advantage normalization; answers in-thread remain inconclusive.
- Magistral uses SFT + RL; commenters note this often outperforms pure-RL models.
Local deployment and tools
- Community GGUF builds are available and run on llama.cpp and Ollama; people share configs (quantization levels, jinja templates, context sizes).
- Magistral Small can run on a 4090 or 32GB Mac after quantization; some run it on older GPUs (e.g., 2080 Ti) and CPUs, trading speed vs hallucinations.
- Tool calling is not yet wired up for the released Small GGUF; others point to Devstral (tool+code finetune) and ongoing work to add tools+thinking in Ollama.
Reasoning behavior and “thinking” debate
- Some users find Magistral “overcooked”: heavy \boxed{} formatting, very long traces, and it may forget to think without the prescribed system prompt.
- The Hitler’s mother example shows the model “thinking” in an extremely repetitive loop over a trivial fact—seen as characteristic of reasoning RL gone too far.
- Large subthread debates whether LLM “thinking”/“reasoning” is real or just statistical token prediction:
- One side insists anthropomorphic terms mislead laypeople and overclaim capability; cites recent “illusion of reasoning/thinking” papers.
- Others argue “thinking” is a term of art for chain-of-thought; humans also fail, are inconsistent, and misreport their internal state, so these critiques don’t clearly separate humans from LLMs.
- Meta‑point: terminology shapes expectations and downstream misuse.
Speed vs quality, and real-world use
- Many praise Mistral’s latency: responses often arrive several times faster than major competitors on non‑web tasks; some view speed as Mistral’s real edge.
- One team reports swapping o4‑mini for Magistral‑Medium in a JSON-heavy feature: latency drops from ~50–70s to ~34–37s with slightly worse but acceptable quality.
- Others counter that for deep research or coding, 4 tokens/s “reasoning” can be painful; speed matters most when long chains of thought or tool use are involved.
Comparisons to other open reasoning models
- DeepSeek‑R1 (full and distills), Qwen3 reasoning variants, and Phi‑4 Reasoning are repeatedly cited as the main open-weight competitors.
- Some see Qwen3‑30B‑A3B as the best “local” reasoning model today; Qwen3‑4B reportedly approaches, and sometimes beats, Magistral‑24B on shared benchmarks.
- Several note Magistral’s advantage is being Apache‑licensed and small enough to run widely, even if raw reasoning scores lag Qwen/DeepSeek in some regimes.
Benchmarks, marketing, and transparency
- Benchmark selection is criticized as narrow (mostly DeepSeek + Mistral baselines, few mainstream evals like MMLU‑Pro or LiveBench).
- Some frame this as typical “marketing-driven” cherry-picking; others say small labs can’t afford to run every new baseline for every release.
- Users appreciate fully visible reasoning traces and see them as valuable for auditability and business adoption—despite research showing trace correctness doesn’t always imply answer correctness.
EU vs US/China ecosystem digression
- Long meta‑thread uses Magistral vs DeepSeek as a springboard into:
- EU regulation (cookies, privacy, AI rules), and whether it hinders innovation,
- funding scarcity vs US megacorps and VC,
- protectionism vs open markets (China as a counterexample),
- quality of life vs “move fast & break things” economies.
- Some argue Mistral is symbolically important for EU AI sovereignty even if it trails SOTA; others note its cap table is heavily non‑European.
Other observations and criticisms
- Style: Mistral’s announcement overuses em‑dashes; some like the voice, others find it distracting or “LLM-ish.”
- OCR: a previous Mistral OCR model badly disappointed at least one user vs classic tools, leading to skepticism about current marketing claims.
- Ideological bias: one commenter reports Magistral sometimes gives more balanced answers on politically charged Wikipedia‑shaped topics than other models.
- Tooling UX: Ollama’s defaults (distilled models, small contexts, naming) draw criticism; some recommend using llama.cpp directly for serious local experimentation.