Gemma 2: Improving Open Language Models at a Practical Size [pdf]

Release & Availability

  • Gemma 2 comes in 2.6B, 9B, and 27B variants; 9B/27B are already on Ollama and Hugging Face, with GGUF quantizations from community members.
  • Support has been added to gemma.cpp, and the 27B IT model is available in Google AI Studio (playground; API integration still “soon” / partial).

Architecture, Tokens & Training

  • Uses explicit special tokens (<bos>, <eos>, <start_of_turn>, <end_of_turn>).
  • Discussion emphasizes their role in training: packing multiple short sequences into fixed-length batches to reduce padding and avoid cross-example leakage.
  • Some argue masking could replace them; others note BOS/EOS make large-scale data packing simpler and safer.
  • Distillation is a central technique: student models learn from teacher logits, effectively “learning a whole distribution” per step, which is argued to be like training on many more tokens.

Comparisons to Phi‑3, Llama 3, Mistral

  • Multiple commenters feel Gemma 2 (especially 2.6B/9B) scores worse than Microsoft’s Phi‑3 in standard benchmarks.
  • Others counter that Phi‑3 appears overfit to benchmarks and underperforms in interactive settings (LMSYS ELO, subjective tests).
  • Gemma 2–27B reportedly ranks near Llama‑3‑70B, GPT‑4, and Claude Sonnet on Chatbot Arena; some personal tests disagree and find Llama‑3‑70B clearly stronger.
  • Accusations of “parameter creep” (9B vs 7–8B peers) and mixed views on whether comparisons are fair.

Context Window & Inference

  • 8K context with a 4K sliding window is seen as a speed–quality tradeoff; some criticize it as too small for serious RAG, others say “effective” 8K is preferable to nominal 32K with degradation.
  • Ollama currently halves context when hitting the limit, which appears to destabilize Gemma 2–27B in some setups; maintainers plan to change this to hard-limit behavior.

Quantization & Local Use

  • 4-bit GGUF versions are popular for local deployment; people debate how much quantization actually hurts quality, with papers cited suggesting subtle regressions, especially on factual tasks.
  • Subjective reports range from “indistinguishable from full precision” to visible degradation in edge cases.

Safety, Alignment & Capabilities

  • Self‑proliferation abilities (e.g., autonomously setting up remote LLMs) reportedly score 0/10 in internal tests; some argue this reflects incapability more than true alignment.
  • Training data is filtered for “unsafe” content; some see this as necessary liability management, others question value given the same information is easy to find via web search.

Benchmarks vs Real Use & Task Fit

  • Benchmarks show Gemma 2 9B lagging Phi‑3 Small on many academic tasks; users report Gemma 2 may be better as a general assistant and conversational model.
  • Mixed experiences on coding: some find Gemma 2 structured and pleasant (no verbose preamble), others report severe nonsense on long code outputs or heavy context.
  • Noted strength: 27B appears unusually strong at multilingual translation, including less common languages, though this is absent from the paper’s emphasis.

APIs, Licensing & Cloud UX

  • Licensing matches Gemma 1; terms are proprietary but permit broad use.
  • Some praise AI Studio’s simplicity vs Google Cloud / Vertex; many complain GCP billing, region restrictions, and documentation are confusing compared to OpenAI/Mistral-style APIs.
  • There is a preview OpenAI-compatible endpoint for Gemini on Vertex, but token/auth model makes drop‑in use harder.