Mistral NeMo
Model overview & positioning
- Mistral NeMo is a 12B-parameter model co-developed with Nvidia, Apache 2.0–licensed, with a 128k-token context window and FP8-aware training for efficient inference.
- It’s marketed as state-of-the-art in its size class, outperforming Llama 3 8B and similar/open models on vendor benchmarks, though some commenters note it has ~50% more parameters than 8B competitors (“parameter creep”).
- Some see 12B as a “sweet spot” for local use: more capable than 7–8B, still feasible on consumer hardware.
VRAM, quantization, and local use
- Rough rules of thumb discussed:
- ~1 GB VRAM per billion params at 8-bit; ~2 GB/B at 16-bit, plus 20–40% overhead.
- FP8-aware training aims for good quality at ~1 byte/parameter; still need headroom for KV cache, especially with 128k context.
- Reports:
- 8 GB VRAM is insufficient; 3060 Ti users hit OOM.
- 4090 (24 GB) can load the full model but may OOM on multi-turn chats in some setups.
- Quantized (4–8 bit) variants should run on 12–16 GB GPUs; some users target MacBooks with large unified memory or free Colab T4s with 4-bit QLoRA fine-tuning.
- Expect support via llama.cpp/OLLAMA/LM Studio, but tokenizer differences (Tekken vs SentencePiece) mean extra work; not fully plug-and-play yet.
Tokenizer & multilingual behavior
- NeMo introduces “Tekken,” a tiktoken/BPE-based tokenizer over 100+ languages, claimed to compress better than previous SentencePiece setups.
- Discussion clarifies SentencePiece is a library that can also use BPE; the switch likely concerns engineering traits (latency, implementation) rather than fundamental compression gains.
- Several comments discuss multilingual “cross-over”: evidence and anecdotes suggest models can transfer facts across languages, but with quirks like the “reversal curse.”
Use cases, alignment, and business questions
- Coding remains the most emphasized specialized use case; other domains like legal/finance are seen as slower to adopt and more liability-sensitive.
- Some criticize alignment/safety layers on “open” models; others argue they’re necessary for reputational and legal reasons.
- Debate over open release: expensive training on thousands of H100s vs giving away weights under Apache 2.0. Many argue most users still pay for hosted convenience; others fear hyperscalers will repackage and outcompete.
- Benchmarks: some early user tests (e.g., NYT Connections game) show NeMo trailing Gemma 2 and GPT‑4o mini; several commenters want LMSYS/arena results before judging.