Ask HN: What is the best LLM for consumer grade hardware?

No Single “Best” Model

  • Commenters stress there is no universally best local LLM; quality varies heavily by task (chat, coding, math, RP, RAG, etc.).
  • Strong advice: download several current models, build your own private benchmarks around your actual use cases, and choose empirically.

Popular Local Models Mentioned

  • Qwen3 family:
    • Qwen3-8B and the DeepSeek-R1-0528-Qwen3-8B distill praised for strong reasoning at 8B.
    • Qwen3-14B recommended as a good “main” model for 16GB VRAM (Q4 or FP8).
    • Qwen3-30B-A3B (MoE) cited as very strong yet usable on constrained VRAM via offload.
  • Gemma3:
    • Gemma3-12B often cited as a good conversationalist, but more hallucination and strong safety filters.
  • Mistral:
    • Mistral Small / Nemo / Devstral mentioned for coding, routing, and relatively uncensored behavior.
  • Others:
    • Qwen2.5-Coder 14B for coding.
    • SmolVLM-500M for tiny setups.
    • LLaMA 3.x, Phi-4, various “uncensored”/“abliterated” fine-tunes for people wanting fewer refusals.
    • Live leaderboards (e.g., coding/LiveBench) suggested for up‑to‑date rankings.

Quantization, VRAM, and Context

  • Core tradeoff: parameters vs quantization vs context length vs speed:
    • Rule of thumb: with 8GB VRAM, aim around 7–8B params at Q4–Q6; with 16GB, 14B dense or 30B MoE at Q4.
    • Very low-bit (≤3–4 bit) can work if quantized carefully, but naive low-bit often gives repetition/instability.
  • Context is expensive: every token is expanded into high‑dimensional vectors, stored per token per layer; huge contexts quickly consume VRAM.
  • CPU/RAM offload works but is much slower; some report offloading specific tensors or “hot” parts as a promising optimization.

Runtimes, Frontends, and Communities

  • Common stacks: llama.cpp (and variants like KoboldCPP), vLLM, Ollama, LM Studio, OpenWebUI, GPT4All, Jan.ai, AnythingLLM, SillyTavern.
  • LM Studio and OpenWebUI highlighted for ease of use; concerns raised about both being closed/proprietary now.
  • Ollama praised as an easy model server that plays well with many UIs; some prefer raw llama.cpp for transparency and faster model support.
  • r/LocalLLaMA widely recommended for discovery and practices, but multiple comments warn about misinformation and upvote‑driven groupthink.

Why Run Locally vs Cloud

  • Pro-local:
    • Privacy (personal notes, family data, schedules, proprietary corp data).
    • Uncensored behavior and fewer refusals.
    • Cost predictability and offline capability.
    • Learning, experimentation, and building custom agents / RAG systems.
  • Pro-cloud:
    • Top proprietary models (Claude/Gemini/GPT‑4‑class) are still markedly better and cheap per query.
    • Local models can require many iterations, making them slower in “time to acceptable answer.”

Hardware Notes

  • 8GB VRAM: 7–8B models at Q4–Q6; larger models with heavy offload if you accept slow speeds.
  • 16GB VRAM: comfortable with Qwen3‑14B or similar at Q4–FP8; 30B MoE possible with offload.
  • Many suggest a used 24GB card (e.g., 3090) if you’re serious; others argue cloud GPUs or APIs are more rational than buying high‑end GPUs.