2024-06-27

Gemma 2: Improving Open Language Models at a Practical Size [pdf]

Release & Availability

Gemma 2 comes in 2.6B, 9B, and 27B variants; 9B/27B are already on Ollama and Hugging Face, with GGUF quantizations from community members.
Support has been added to gemma.cpp, and the 27B IT model is available in Google AI Studio (playground; API integration still “soon” / partial).

Architecture, Tokens & Training

Uses explicit special tokens (<bos>, <eos>, <start_of_turn>, <end_of_turn>).
Discussion emphasizes their role in training: packing multiple short sequences into fixed-length batches to reduce padding and avoid cross-example leakage.
Some argue masking could replace them; others note BOS/EOS make large-scale data packing simpler and safer.
Distillation is a central technique: student models learn from teacher logits, effectively “learning a whole distribution” per step, which is argued to be like training on many more tokens.

Comparisons to Phi‑3, Llama 3, Mistral

Multiple commenters feel Gemma 2 (especially 2.6B/9B) scores worse than Microsoft’s Phi‑3 in standard benchmarks.
Others counter that Phi‑3 appears overfit to benchmarks and underperforms in interactive settings (LMSYS ELO, subjective tests).
Gemma 2–27B reportedly ranks near Llama‑3‑70B, GPT‑4, and Claude Sonnet on Chatbot Arena; some personal tests disagree and find Llama‑3‑70B clearly stronger.
Accusations of “parameter creep” (9B vs 7–8B peers) and mixed views on whether comparisons are fair.

Context Window & Inference

8K context with a 4K sliding window is seen as a speed–quality tradeoff; some criticize it as too small for serious RAG, others say “effective” 8K is preferable to nominal 32K with degradation.
Ollama currently halves context when hitting the limit, which appears to destabilize Gemma 2–27B in some setups; maintainers plan to change this to hard-limit behavior.

Quantization & Local Use

4-bit GGUF versions are popular for local deployment; people debate how much quantization actually hurts quality, with papers cited suggesting subtle regressions, especially on factual tasks.
Subjective reports range from “indistinguishable from full precision” to visible degradation in edge cases.

Safety, Alignment & Capabilities

Self‑proliferation abilities (e.g., autonomously setting up remote LLMs) reportedly score 0/10 in internal tests; some argue this reflects incapability more than true alignment.
Training data is filtered for “unsafe” content; some see this as necessary liability management, others question value given the same information is easy to find via web search.

Benchmarks vs Real Use & Task Fit

Benchmarks show Gemma 2 9B lagging Phi‑3 Small on many academic tasks; users report Gemma 2 may be better as a general assistant and conversational model.
Mixed experiences on coding: some find Gemma 2 structured and pleasant (no verbose preamble), others report severe nonsense on long code outputs or heavy context.
Noted strength: 27B appears unusually strong at multilingual translation, including less common languages, though this is absent from the paper’s emphasis.

APIs, Licensing & Cloud UX

Licensing matches Gemma 1; terms are proprietary but permit broad use.
Some praise AI Studio’s simplicity vs Google Cloud / Vertex; many complain GCP billing, region restrictions, and documentation are confusing compared to OpenAI/Mistral-style APIs.
There is a preview OpenAI-compatible endpoint for Gemini on Vertex, but token/auth model makes drop‑in use harder.

Related topics