Gemma3 – The current strongest model that fits on a single GPU
Benchmarking, Charts & “Strongest Model” Claims
- Several commenters see the promo bar chart as misleading: it omits higher-ranked models in the LMSYS arena, making Gemma 3 appear #2 overall.
- Critics argue that comparing “best open and closed models” while excluding top entries is disingenuous, especially since some excluded models (e.g. DeepSeek V3) feel clearly stronger in practice.
- This fuels broader distrust in vendor benchmarks and marketing; some say a few minutes of hands-on use shows how “broken” current benchmark culture is.
- Others note leaderboards like Hugging Face’s open-llm-leaderboard use narrow metrics that can be gamed and may not reflect general-purpose quality.
Real-World Performance & Use Cases
- Mixed experiences with previous Gemma versions; some found them underwhelming, especially for coding and tool calling, while others say Gemma 2 was good for writing.
- Early testers of Gemma 3 (especially 27B) report it can be extremely strong on Google AI Studio, including nontrivial coding tasks and structured reasoning.
- However, multiple users report significantly worse behavior via Ollama/Open WebUI (syntax errors, poor prompt adherence, generic explanations, weird language output).
- There’s interest in how Gemma 3 does on tool/function calling; some couldn’t get it working in Ollama at all.
Ollama, Templates, Quantization & Settings
- Several people warn against using Ollama for serious evaluation: it truncates context without clear failure, and may mishandle long prompts.
- Others counter that there are warnings in logs and that explicit context settings help.
- Discrepancies between AI Studio and local runs are blamed on: quantization sensitivity, sampling parameters, chat templates, tokenizer quirks, and possibly bugs.
- Recommended Gemma 3 settings (from Unsloth / Gemma team) are around temperature 0.95, top_p 0.95, top_k 64, but some find much lower temperatures (e.g. 0.1 in Ollama) work better.
System Prompts & Steerability
- Confusion over whether Gemma 3 supports system prompts: the official format doesn’t clearly expose them, and AI Studio doesn’t show a system field.
- GGUF templates appear to just prepend instructions to the first user message, leading some to argue system vs user prompt is mostly a convention; others insist system messages matter for consistent behavior, tool calling, and prompt-injection resistance.
Small vs Large, Multimodal vs Specialized
- Some prefer small, fast local models for specific tasks (summarization, simple coding, translation, D&D/roleplay, show-note generation), listing many strong 3–14B models (Qwen2.5, Mistral Small, Phi-4, various fine-tunes).
- Others find sub-7B models still “useless” for their needs.
- One line of criticism: multimodal “do everything” models waste parameters/VRAM when a user only needs text/code; specialized text-only models are seen as more practical for single-GPU setups.
Quantization & Inference Quality
- Debate over 4-bit vs 8/16-bit: some say 4-bit is “generally good” and 8-bit overkill; others argue that for modern, stronger models, heavy quantization severely hurts reliability, multilingual ability, and fine-tune knowledge.
- Newer training schemes (FP8, QAT) may change this, but real-world behavior is still being explored.
- There’s some praise for frameworks (llama.cpp, gemma.cpp, etc.) rapidly supporting Gemma 3, but also frustration with past Google-specific formats and fragmented tooling.
Hardware & Future of Local Models
- Discussion about what “fits on a single GPU” really means: people report running 27B models on consumer GPUs (A4000, 3090) at workable speeds, sometimes CPU-only at lower tok/s.
- One commenter predicts discrete GPUs are “finished” for AI in favor of high-RAM APUs (Apple M-series, AMD Strix Halo); others call this unrealistic given gaming, CUDA dominance, and cost.
Motivations for Open & Local Models
- Drivers for local/open models: privacy (personal documents, email, browser automation), PII handling, avoiding censorship, and avoiding API lock-in or deprecation.
- Even if most can’t self-host giant models like DeepSeek R1 today, having weights available is seen as strategic insurance for businesses.
- Some contrast this with big closed providers (OpenAI, Anthropic) that rarely release weights, though older releases like Whisper are acknowledged.