2024-07-18

Mistral NeMo

Model overview & positioning

Mistral NeMo is a 12B-parameter model co-developed with Nvidia, Apache 2.0–licensed, with a 128k-token context window and FP8-aware training for efficient inference.
It’s marketed as state-of-the-art in its size class, outperforming Llama 3 8B and similar/open models on vendor benchmarks, though some commenters note it has ~50% more parameters than 8B competitors (“parameter creep”).
Some see 12B as a “sweet spot” for local use: more capable than 7–8B, still feasible on consumer hardware.

VRAM, quantization, and local use

Rough rules of thumb discussed:
- ~1 GB VRAM per billion params at 8-bit; ~2 GB/B at 16-bit, plus 20–40% overhead.
- FP8-aware training aims for good quality at ~1 byte/parameter; still need headroom for KV cache, especially with 128k context.
Reports:
- 8 GB VRAM is insufficient; 3060 Ti users hit OOM.
- 4090 (24 GB) can load the full model but may OOM on multi-turn chats in some setups.
Quantized (4–8 bit) variants should run on 12–16 GB GPUs; some users target MacBooks with large unified memory or free Colab T4s with 4-bit QLoRA fine-tuning.
Expect support via llama.cpp/OLLAMA/LM Studio, but tokenizer differences (Tekken vs SentencePiece) mean extra work; not fully plug-and-play yet.

Tokenizer & multilingual behavior

NeMo introduces “Tekken,” a tiktoken/BPE-based tokenizer over 100+ languages, claimed to compress better than previous SentencePiece setups.
Discussion clarifies SentencePiece is a library that can also use BPE; the switch likely concerns engineering traits (latency, implementation) rather than fundamental compression gains.
Several comments discuss multilingual “cross-over”: evidence and anecdotes suggest models can transfer facts across languages, but with quirks like the “reversal curse.”

Use cases, alignment, and business questions

Coding remains the most emphasized specialized use case; other domains like legal/finance are seen as slower to adopt and more liability-sensitive.
Some criticize alignment/safety layers on “open” models; others argue they’re necessary for reputational and legal reasons.
Debate over open release: expensive training on thousands of H100s vs giving away weights under Apache 2.0. Many argue most users still pay for hosted convenience; others fear hyperscalers will repackage and outcompete.
Benchmarks: some early user tests (e.g., NYT Connections game) show NeMo trailing Gemma 2 and GPT‑4o mini; several commenters want LMSYS/arena results before judging.

Related topics