Magistral — the first reasoning model by Mistral AI

Model performance, size, and benchmarks

  • Small (24B) Magistral is seen as very efficient relative to DeepSeek V3 (671B total / 37B active), with strong math/logic scores, especially under majority voting.
  • Medium’s parameter count isn’t disclosed; some speculate it’s ~70B based on past leaks, but this is unconfirmed.
  • Many commenters note Magistral loses to DeepSeek-R1 on one‑shot benchmarks, and that Mistral compares against older R1 numbers rather than the stronger R1‑0528 release; this is viewed by some as selective or “outdated on release”.
  • Several people wish Magistral had been compared to Qwen3 (especially Qwen3‑30B‑A3B) and o3/o4‑mini, arguing those are current reasoning SOTA in the same compute band.

Training method and RL details

  • Discussion dives into the Magistral paper: a GRPO variant with:
    • KL penalty effectively removed (β=0),
    • length normalization of rewards,
    • minibatch advantage normalization,
    • relaxed trust region.
  • Some see dropping KL as a current “trend” without strong justification; others say KL can overly constrain learning from the base checkpoint.
  • Questions are raised about the theoretical motivation and real benefit of minibatch advantage normalization; answers in-thread remain inconclusive.
  • Magistral uses SFT + RL; commenters note this often outperforms pure-RL models.

Local deployment and tools

  • Community GGUF builds are available and run on llama.cpp and Ollama; people share configs (quantization levels, jinja templates, context sizes).
  • Magistral Small can run on a 4090 or 32GB Mac after quantization; some run it on older GPUs (e.g., 2080 Ti) and CPUs, trading speed vs hallucinations.
  • Tool calling is not yet wired up for the released Small GGUF; others point to Devstral (tool+code finetune) and ongoing work to add tools+thinking in Ollama.

Reasoning behavior and “thinking” debate

  • Some users find Magistral “overcooked”: heavy \boxed{} formatting, very long traces, and it may forget to think without the prescribed system prompt.
  • The Hitler’s mother example shows the model “thinking” in an extremely repetitive loop over a trivial fact—seen as characteristic of reasoning RL gone too far.
  • Large subthread debates whether LLM “thinking”/“reasoning” is real or just statistical token prediction:
    • One side insists anthropomorphic terms mislead laypeople and overclaim capability; cites recent “illusion of reasoning/thinking” papers.
    • Others argue “thinking” is a term of art for chain-of-thought; humans also fail, are inconsistent, and misreport their internal state, so these critiques don’t clearly separate humans from LLMs.
    • Meta‑point: terminology shapes expectations and downstream misuse.

Speed vs quality, and real-world use

  • Many praise Mistral’s latency: responses often arrive several times faster than major competitors on non‑web tasks; some view speed as Mistral’s real edge.
  • One team reports swapping o4‑mini for Magistral‑Medium in a JSON-heavy feature: latency drops from ~50–70s to ~34–37s with slightly worse but acceptable quality.
  • Others counter that for deep research or coding, 4 tokens/s “reasoning” can be painful; speed matters most when long chains of thought or tool use are involved.

Comparisons to other open reasoning models

  • DeepSeek‑R1 (full and distills), Qwen3 reasoning variants, and Phi‑4 Reasoning are repeatedly cited as the main open-weight competitors.
  • Some see Qwen3‑30B‑A3B as the best “local” reasoning model today; Qwen3‑4B reportedly approaches, and sometimes beats, Magistral‑24B on shared benchmarks.
  • Several note Magistral’s advantage is being Apache‑licensed and small enough to run widely, even if raw reasoning scores lag Qwen/DeepSeek in some regimes.

Benchmarks, marketing, and transparency

  • Benchmark selection is criticized as narrow (mostly DeepSeek + Mistral baselines, few mainstream evals like MMLU‑Pro or LiveBench).
  • Some frame this as typical “marketing-driven” cherry-picking; others say small labs can’t afford to run every new baseline for every release.
  • Users appreciate fully visible reasoning traces and see them as valuable for auditability and business adoption—despite research showing trace correctness doesn’t always imply answer correctness.

EU vs US/China ecosystem digression

  • Long meta‑thread uses Magistral vs DeepSeek as a springboard into:
    • EU regulation (cookies, privacy, AI rules), and whether it hinders innovation,
    • funding scarcity vs US megacorps and VC,
    • protectionism vs open markets (China as a counterexample),
    • quality of life vs “move fast & break things” economies.
  • Some argue Mistral is symbolically important for EU AI sovereignty even if it trails SOTA; others note its cap table is heavily non‑European.

Other observations and criticisms

  • Style: Mistral’s announcement overuses em‑dashes; some like the voice, others find it distracting or “LLM-ish.”
  • OCR: a previous Mistral OCR model badly disappointed at least one user vs classic tools, leading to skepticism about current marketing claims.
  • Ideological bias: one commenter reports Magistral sometimes gives more balanced answers on politically charged Wikipedia‑shaped topics than other models.
  • Tooling UX: Ollama’s defaults (distilled models, small contexts, naming) draw criticism; some recommend using llama.cpp directly for serious local experimentation.