Gemma 2: Improving Open Language Models at a Practical Size [pdf]
Release & Availability
- Gemma 2 comes in 2.6B, 9B, and 27B variants; 9B/27B are already on Ollama and Hugging Face, with GGUF quantizations from community members.
- Support has been added to gemma.cpp, and the 27B IT model is available in Google AI Studio (playground; API integration still “soon” / partial).
Architecture, Tokens & Training
- Uses explicit special tokens (
<bos>,<eos>,<start_of_turn>,<end_of_turn>). - Discussion emphasizes their role in training: packing multiple short sequences into fixed-length batches to reduce padding and avoid cross-example leakage.
- Some argue masking could replace them; others note BOS/EOS make large-scale data packing simpler and safer.
- Distillation is a central technique: student models learn from teacher logits, effectively “learning a whole distribution” per step, which is argued to be like training on many more tokens.
Comparisons to Phi‑3, Llama 3, Mistral
- Multiple commenters feel Gemma 2 (especially 2.6B/9B) scores worse than Microsoft’s Phi‑3 in standard benchmarks.
- Others counter that Phi‑3 appears overfit to benchmarks and underperforms in interactive settings (LMSYS ELO, subjective tests).
- Gemma 2–27B reportedly ranks near Llama‑3‑70B, GPT‑4, and Claude Sonnet on Chatbot Arena; some personal tests disagree and find Llama‑3‑70B clearly stronger.
- Accusations of “parameter creep” (9B vs 7–8B peers) and mixed views on whether comparisons are fair.
Context Window & Inference
- 8K context with a 4K sliding window is seen as a speed–quality tradeoff; some criticize it as too small for serious RAG, others say “effective” 8K is preferable to nominal 32K with degradation.
- Ollama currently halves context when hitting the limit, which appears to destabilize Gemma 2–27B in some setups; maintainers plan to change this to hard-limit behavior.
Quantization & Local Use
- 4-bit GGUF versions are popular for local deployment; people debate how much quantization actually hurts quality, with papers cited suggesting subtle regressions, especially on factual tasks.
- Subjective reports range from “indistinguishable from full precision” to visible degradation in edge cases.
Safety, Alignment & Capabilities
- Self‑proliferation abilities (e.g., autonomously setting up remote LLMs) reportedly score 0/10 in internal tests; some argue this reflects incapability more than true alignment.
- Training data is filtered for “unsafe” content; some see this as necessary liability management, others question value given the same information is easy to find via web search.
Benchmarks vs Real Use & Task Fit
- Benchmarks show Gemma 2 9B lagging Phi‑3 Small on many academic tasks; users report Gemma 2 may be better as a general assistant and conversational model.
- Mixed experiences on coding: some find Gemma 2 structured and pleasant (no verbose preamble), others report severe nonsense on long code outputs or heavy context.
- Noted strength: 27B appears unusually strong at multilingual translation, including less common languages, though this is absent from the paper’s emphasis.
APIs, Licensing & Cloud UX
- Licensing matches Gemma 1; terms are proprietary but permit broad use.
- Some praise AI Studio’s simplicity vs Google Cloud / Vertex; many complain GCP billing, region restrictions, and documentation are confusing compared to OpenAI/Mistral-style APIs.
- There is a preview OpenAI-compatible endpoint for Gemini on Vertex, but token/auth model makes drop‑in use harder.