Building a personal, private AI computer on a budget

Quantization, precision & model behavior

  • Many argue P40’s poor FP16 isn’t critical because local setups usually run quantized models (Q4–Q8) to fit VRAM, often with negligible loss at Q6–Q8.
  • Quantizing the KV cache (context) can greatly expand context length and reduce memory, but quality impact is model- and task-dependent.
    • Some models (e.g., Command-R) handle KV quantization well, others (e.g., Qwen) can “go nuts,” especially on context‑sensitive tasks like translation or evaluation; more forgiving for coding/creative use.
  • There’s confusion over precision actually used at inference and how “standard” KV quantization is; consensus: it’s supported widely but not universally safe to enable.

Performance, tokens/sec & usability

  • 4 tokens/sec on a 671B model is seen by some as “runs but unusable,” others say it’s fine for async, deep or overnight jobs, or agentic workflows.
  • For interactive coding or long back‑and‑forth chats, many want ≥10–40 tok/s; sub‑10 tok/s feels sluggish, especially with large outputs.
  • Single-user home setups are typically batch=1; many cloud comparisons note that for $2k of API usage you can often buy billions of tokens at high speed, so local heavy models only “win” if you have sustained, high usage or strict privacy needs.

Hardware tradeoffs: GPUs, Apple silicon, and “budget”

  • Used server GPUs (P40, M40, K80, P41, etc.) offer lots of VRAM cheaply but bring driver pain, missing CUDA/compute features, high power, and often poor perf per watt; some are effectively “toy slow.”
  • Consumer GPUs (3090, 3060, 1080 Ti, 4090) generally outperform P40‑style cards, but high‑VRAM models are expensive; multi‑GPU setups must watch PCIe bandwidth and sharding strategies.
  • eGPU over USB4/Thunderbolt can work surprisingly well if the whole model fits in VRAM (LLMs see negligible perf loss; ~10% for some PyTorch workloads).
  • Apple M‑series (especially Mac mini / Studio) are highlighted as compelling: unified memory, decent bandwidth, low power, very simple setup.
    • Counterpoints: weaker memory bandwidth vs high‑end NVIDIA, no CUDA so many research/codebases don’t “just work,” and some generative image models (e.g., Flux) are much slower.
  • Several commenters feel calling a dual‑P40, ~€1700 build “budget” is misleading; “real budget” is closer to a single mid‑range GPU or repurposed existing hardware.

Cloud vs local: cost, privacy, and risk

  • Many see local LLMs as an enthusiast hobby with questionable ROI; renting GPUs or using APIs is usually cheaper per unit of useful work, especially as models and hardware obsolete quickly.
  • Some mix approaches: a local smaller model handles private/tool-calling tasks, escalating non‑sensitive heavy work to cloud models.
  • Arguments for local:
    • Strong privacy (avoiding ToS changes, data leaks, opaque “shadow prompting” and provider-side guardrails).
    • Predictable spend vs cloud “tail risk” from misconfigured GPU instances.
  • Arguments for cloud:
    • Better models, higher speed, no hardware/driver headaches, easy to switch as new models appear.

Practical advice & ecosystem state

  • VRAM needs exceed raw model size due to KV cache and context; over‑provisioning is common practice. Quantized variants (Q4/Q5/Q6) on HuggingFace often list real RAM requirements.
  • Tiny 7–8B models on phones or low‑RAM laptops are often judged “tinkering only” for serious coding, though some report acceptable work use with careful model choice.
  • Tools discussed: Ollama, llama.cpp (including distributed/RPC mode), ktransformers for MOE offload, Intel’s IPEX‑LLM, plus various UIs (OpenWebUI, LibreChat).
  • Overall sentiment: homelab AI is fun and educational, but for most people it’s still a niche, fast‑moving, anecdote‑driven space rather than a clear economic win.