2025-01-30

Mistral Small 3

Position in the AI landscape

Seen as Mistral’s move to stay relevant against OpenAI, DeepSeek, Qwen, Llama, etc., with some saying their earlier models fell behind Llama.
Several comments compare it to GPT‑4o‑mini; some say performance is “on par or better,” others dismiss that tier as only good for chatty “fun” use.
Google’s Gemini line is repeatedly brought up as a quiet but very strong competitor; some claim Gemini 2.0 / exp models are now leading, others report regressions on long-context comprehension.

Model size, performance & hardware

24B parameters hits a “sweet spot” for local use: fits (when quantized) on RTX 4090 / high‑RAM Macs and some 24GB cards.
Reported speeds (quantized): ~14 tok/s on M2 Max 64GB, ~16 tok/s on 4090 laptop, ~20 tok/s on 7900 XTX, lower on M1 Pro.
Discussion on VRAM vs system RAM: many can’t fit larger models; some would accept slower inference if it allowed bigger models, but others emphasize memory bandwidth as the real bottleneck.

Training choices & synthetic data

Mistral states no RL and no synthetic data; some find the lack of synthetic data “strange,” others note complaints about synthetic‑heavy models overfitting to STEM and struggling with fuzzier tasks.
People speculate about later RL-style reasoning finetunes (à la DeepSeek) on top of this base.

Licensing, “open source” and copyright

Announcement that general‑purpose models are moving back to Apache 2.0 is welcomed as a big win for local and commercial use.
Thread stresses this applies to weights; training code and datasets remain closed.
Long debate over whether model weights are copyrightable, and whether calling such releases “open source” is misleading:
- One side: weights-only releases are akin to binaries; should be called “open weights,” not FOSS.
- Other side: open weights are already hugely valuable (self‑hosting, fine‑tuning, commercialization) even without full data pipelines.

Use cases for “small” models

Suggested uses: local assistants, automated workflows, RAG, classification/tagging, ETL entity extraction, sentiment/feedback analysis, fraud detection, triage, on‑device control, coding assistance, structured JSON/tool calling.
Several practitioners say recent instruction-following improvements make small LLMs viable for many classification and extraction tasks, often after prompt tuning and benchmarking vs traditional ML.

Benchmarks & evaluations

One external evaluation on the MATH (hard) benchmark reports ~45% accuracy with multi‑sampling.
Users informally compare it favorably against Qwen 2.5 32B and some earlier Mistral / local models, especially for code and local knowledge tasks.

Related topics