Ask HN: What is the best LLM for consumer grade hardware?
No Single “Best” Model
- Commenters stress there is no universally best local LLM; quality varies heavily by task (chat, coding, math, RP, RAG, etc.).
- Strong advice: download several current models, build your own private benchmarks around your actual use cases, and choose empirically.
Popular Local Models Mentioned
- Qwen3 family:
- Qwen3-8B and the DeepSeek-R1-0528-Qwen3-8B distill praised for strong reasoning at 8B.
- Qwen3-14B recommended as a good “main” model for 16GB VRAM (Q4 or FP8).
- Qwen3-30B-A3B (MoE) cited as very strong yet usable on constrained VRAM via offload.
- Gemma3:
- Gemma3-12B often cited as a good conversationalist, but more hallucination and strong safety filters.
- Mistral:
- Mistral Small / Nemo / Devstral mentioned for coding, routing, and relatively uncensored behavior.
- Others:
- Qwen2.5-Coder 14B for coding.
- SmolVLM-500M for tiny setups.
- LLaMA 3.x, Phi-4, various “uncensored”/“abliterated” fine-tunes for people wanting fewer refusals.
- Live leaderboards (e.g., coding/LiveBench) suggested for up‑to‑date rankings.
Quantization, VRAM, and Context
- Core tradeoff: parameters vs quantization vs context length vs speed:
- Rule of thumb: with 8GB VRAM, aim around 7–8B params at Q4–Q6; with 16GB, 14B dense or 30B MoE at Q4.
- Very low-bit (≤3–4 bit) can work if quantized carefully, but naive low-bit often gives repetition/instability.
- Context is expensive: every token is expanded into high‑dimensional vectors, stored per token per layer; huge contexts quickly consume VRAM.
- CPU/RAM offload works but is much slower; some report offloading specific tensors or “hot” parts as a promising optimization.
Runtimes, Frontends, and Communities
- Common stacks: llama.cpp (and variants like KoboldCPP), vLLM, Ollama, LM Studio, OpenWebUI, GPT4All, Jan.ai, AnythingLLM, SillyTavern.
- LM Studio and OpenWebUI highlighted for ease of use; concerns raised about both being closed/proprietary now.
- Ollama praised as an easy model server that plays well with many UIs; some prefer raw llama.cpp for transparency and faster model support.
- r/LocalLLaMA widely recommended for discovery and practices, but multiple comments warn about misinformation and upvote‑driven groupthink.
Why Run Locally vs Cloud
- Pro-local:
- Privacy (personal notes, family data, schedules, proprietary corp data).
- Uncensored behavior and fewer refusals.
- Cost predictability and offline capability.
- Learning, experimentation, and building custom agents / RAG systems.
- Pro-cloud:
- Top proprietary models (Claude/Gemini/GPT‑4‑class) are still markedly better and cheap per query.
- Local models can require many iterations, making them slower in “time to acceptable answer.”
Hardware Notes
- 8GB VRAM: 7–8B models at Q4–Q6; larger models with heavy offload if you accept slow speeds.
- 16GB VRAM: comfortable with Qwen3‑14B or similar at Q4–FP8; 30B MoE possible with offload.
- Many suggest a used 24GB card (e.g., 3090) if you’re serious; others argue cloud GPUs or APIs are more rational than buying high‑end GPUs.