Building a personal, private AI computer on a budget
Quantization, precision & model behavior
- Many argue P40’s poor FP16 isn’t critical because local setups usually run quantized models (Q4–Q8) to fit VRAM, often with negligible loss at Q6–Q8.
- Quantizing the KV cache (context) can greatly expand context length and reduce memory, but quality impact is model- and task-dependent.
- Some models (e.g., Command-R) handle KV quantization well, others (e.g., Qwen) can “go nuts,” especially on context‑sensitive tasks like translation or evaluation; more forgiving for coding/creative use.
- There’s confusion over precision actually used at inference and how “standard” KV quantization is; consensus: it’s supported widely but not universally safe to enable.
Performance, tokens/sec & usability
- 4 tokens/sec on a 671B model is seen by some as “runs but unusable,” others say it’s fine for async, deep or overnight jobs, or agentic workflows.
- For interactive coding or long back‑and‑forth chats, many want ≥10–40 tok/s; sub‑10 tok/s feels sluggish, especially with large outputs.
- Single-user home setups are typically batch=1; many cloud comparisons note that for $2k of API usage you can often buy billions of tokens at high speed, so local heavy models only “win” if you have sustained, high usage or strict privacy needs.
Hardware tradeoffs: GPUs, Apple silicon, and “budget”
- Used server GPUs (P40, M40, K80, P41, etc.) offer lots of VRAM cheaply but bring driver pain, missing CUDA/compute features, high power, and often poor perf per watt; some are effectively “toy slow.”
- Consumer GPUs (3090, 3060, 1080 Ti, 4090) generally outperform P40‑style cards, but high‑VRAM models are expensive; multi‑GPU setups must watch PCIe bandwidth and sharding strategies.
- eGPU over USB4/Thunderbolt can work surprisingly well if the whole model fits in VRAM (LLMs see negligible perf loss; ~10% for some PyTorch workloads).
- Apple M‑series (especially Mac mini / Studio) are highlighted as compelling: unified memory, decent bandwidth, low power, very simple setup.
- Counterpoints: weaker memory bandwidth vs high‑end NVIDIA, no CUDA so many research/codebases don’t “just work,” and some generative image models (e.g., Flux) are much slower.
- Several commenters feel calling a dual‑P40, ~€1700 build “budget” is misleading; “real budget” is closer to a single mid‑range GPU or repurposed existing hardware.
Cloud vs local: cost, privacy, and risk
- Many see local LLMs as an enthusiast hobby with questionable ROI; renting GPUs or using APIs is usually cheaper per unit of useful work, especially as models and hardware obsolete quickly.
- Some mix approaches: a local smaller model handles private/tool-calling tasks, escalating non‑sensitive heavy work to cloud models.
- Arguments for local:
- Strong privacy (avoiding ToS changes, data leaks, opaque “shadow prompting” and provider-side guardrails).
- Predictable spend vs cloud “tail risk” from misconfigured GPU instances.
- Arguments for cloud:
- Better models, higher speed, no hardware/driver headaches, easy to switch as new models appear.
Practical advice & ecosystem state
- VRAM needs exceed raw model size due to KV cache and context; over‑provisioning is common practice. Quantized variants (Q4/Q5/Q6) on HuggingFace often list real RAM requirements.
- Tiny 7–8B models on phones or low‑RAM laptops are often judged “tinkering only” for serious coding, though some report acceptable work use with careful model choice.
- Tools discussed: Ollama, llama.cpp (including distributed/RPC mode), ktransformers for MOE offload, Intel’s IPEX‑LLM, plus various UIs (OpenWebUI, LibreChat).
- Overall sentiment: homelab AI is fun and educational, but for most people it’s still a niche, fast‑moving, anecdote‑driven space rather than a clear economic win.