Mistral Small 3

Position in the AI landscape

  • Seen as Mistral’s move to stay relevant against OpenAI, DeepSeek, Qwen, Llama, etc., with some saying their earlier models fell behind Llama.
  • Several comments compare it to GPT‑4o‑mini; some say performance is “on par or better,” others dismiss that tier as only good for chatty “fun” use.
  • Google’s Gemini line is repeatedly brought up as a quiet but very strong competitor; some claim Gemini 2.0 / exp models are now leading, others report regressions on long-context comprehension.

Model size, performance & hardware

  • 24B parameters hits a “sweet spot” for local use: fits (when quantized) on RTX 4090 / high‑RAM Macs and some 24GB cards.
  • Reported speeds (quantized): ~14 tok/s on M2 Max 64GB, ~16 tok/s on 4090 laptop, ~20 tok/s on 7900 XTX, lower on M1 Pro.
  • Discussion on VRAM vs system RAM: many can’t fit larger models; some would accept slower inference if it allowed bigger models, but others emphasize memory bandwidth as the real bottleneck.

Training choices & synthetic data

  • Mistral states no RL and no synthetic data; some find the lack of synthetic data “strange,” others note complaints about synthetic‑heavy models overfitting to STEM and struggling with fuzzier tasks.
  • People speculate about later RL-style reasoning finetunes (à la DeepSeek) on top of this base.

Licensing, “open source” and copyright

  • Announcement that general‑purpose models are moving back to Apache 2.0 is welcomed as a big win for local and commercial use.
  • Thread stresses this applies to weights; training code and datasets remain closed.
  • Long debate over whether model weights are copyrightable, and whether calling such releases “open source” is misleading:
    • One side: weights-only releases are akin to binaries; should be called “open weights,” not FOSS.
    • Other side: open weights are already hugely valuable (self‑hosting, fine‑tuning, commercialization) even without full data pipelines.

Use cases for “small” models

  • Suggested uses: local assistants, automated workflows, RAG, classification/tagging, ETL entity extraction, sentiment/feedback analysis, fraud detection, triage, on‑device control, coding assistance, structured JSON/tool calling.
  • Several practitioners say recent instruction-following improvements make small LLMs viable for many classification and extraction tasks, often after prompt tuning and benchmarking vs traditional ML.

Benchmarks & evaluations

  • One external evaluation on the MATH (hard) benchmark reports ~45% accuracy with multi‑sampling.
  • Users informally compare it favorably against Qwen 2.5 32B and some earlier Mistral / local models, especially for code and local knowledge tasks.