2025-05-30

Ask HN: What is the best LLM for consumer grade hardware?

No Single “Best” Model

Commenters stress there is no universally best local LLM; quality varies heavily by task (chat, coding, math, RP, RAG, etc.).
Strong advice: download several current models, build your own private benchmarks around your actual use cases, and choose empirically.

Popular Local Models Mentioned

Qwen3 family:
- Qwen3-8B and the DeepSeek-R1-0528-Qwen3-8B distill praised for strong reasoning at 8B.
- Qwen3-14B recommended as a good “main” model for 16GB VRAM (Q4 or FP8).
- Qwen3-30B-A3B (MoE) cited as very strong yet usable on constrained VRAM via offload.
Gemma3:
- Gemma3-12B often cited as a good conversationalist, but more hallucination and strong safety filters.
Mistral:
- Mistral Small / Nemo / Devstral mentioned for coding, routing, and relatively uncensored behavior.
Others:
- Qwen2.5-Coder 14B for coding.
- SmolVLM-500M for tiny setups.
- LLaMA 3.x, Phi-4, various “uncensored”/“abliterated” fine-tunes for people wanting fewer refusals.
- Live leaderboards (e.g., coding/LiveBench) suggested for up‑to‑date rankings.

Quantization, VRAM, and Context

Core tradeoff: parameters vs quantization vs context length vs speed:
- Rule of thumb: with 8GB VRAM, aim around 7–8B params at Q4–Q6; with 16GB, 14B dense or 30B MoE at Q4.
- Very low-bit (≤3–4 bit) can work if quantized carefully, but naive low-bit often gives repetition/instability.
Context is expensive: every token is expanded into high‑dimensional vectors, stored per token per layer; huge contexts quickly consume VRAM.
CPU/RAM offload works but is much slower; some report offloading specific tensors or “hot” parts as a promising optimization.

Runtimes, Frontends, and Communities

Common stacks: llama.cpp (and variants like KoboldCPP), vLLM, Ollama, LM Studio, OpenWebUI, GPT4All, Jan.ai, AnythingLLM, SillyTavern.
LM Studio and OpenWebUI highlighted for ease of use; concerns raised about both being closed/proprietary now.
Ollama praised as an easy model server that plays well with many UIs; some prefer raw llama.cpp for transparency and faster model support.
r/LocalLLaMA widely recommended for discovery and practices, but multiple comments warn about misinformation and upvote‑driven groupthink.

Why Run Locally vs Cloud

Pro-local:
- Privacy (personal notes, family data, schedules, proprietary corp data).
- Uncensored behavior and fewer refusals.
- Cost predictability and offline capability.
- Learning, experimentation, and building custom agents / RAG systems.
Pro-cloud:
- Top proprietary models (Claude/Gemini/GPT‑4‑class) are still markedly better and cheap per query.
- Local models can require many iterations, making them slower in “time to acceptable answer.”

Hardware Notes

8GB VRAM: 7–8B models at Q4–Q6; larger models with heavy offload if you accept slow speeds.
16GB VRAM: comfortable with Qwen3‑14B or similar at Q4–FP8; 30B MoE possible with offload.
Many suggest a used 24GB card (e.g., 3090) if you’re serious; others argue cloud GPUs or APIs are more rational than buying high‑end GPUs.

Related topics