2025-02-10

Building a personal, private AI computer on a budget

Quantization, precision & model behavior

Many argue P40’s poor FP16 isn’t critical because local setups usually run quantized models (Q4–Q8) to fit VRAM, often with negligible loss at Q6–Q8.
Quantizing the KV cache (context) can greatly expand context length and reduce memory, but quality impact is model- and task-dependent.
- Some models (e.g., Command-R) handle KV quantization well, others (e.g., Qwen) can “go nuts,” especially on context‑sensitive tasks like translation or evaluation; more forgiving for coding/creative use.
There’s confusion over precision actually used at inference and how “standard” KV quantization is; consensus: it’s supported widely but not universally safe to enable.

Performance, tokens/sec & usability

4 tokens/sec on a 671B model is seen by some as “runs but unusable,” others say it’s fine for async, deep or overnight jobs, or agentic workflows.
For interactive coding or long back‑and‑forth chats, many want ≥10–40 tok/s; sub‑10 tok/s feels sluggish, especially with large outputs.
Single-user home setups are typically batch=1; many cloud comparisons note that for $2k of API usage you can often buy billions of tokens at high speed, so local heavy models only “win” if you have sustained, high usage or strict privacy needs.

Hardware tradeoffs: GPUs, Apple silicon, and “budget”

Used server GPUs (P40, M40, K80, P41, etc.) offer lots of VRAM cheaply but bring driver pain, missing CUDA/compute features, high power, and often poor perf per watt; some are effectively “toy slow.”
Consumer GPUs (3090, 3060, 1080 Ti, 4090) generally outperform P40‑style cards, but high‑VRAM models are expensive; multi‑GPU setups must watch PCIe bandwidth and sharding strategies.
eGPU over USB4/Thunderbolt can work surprisingly well if the whole model fits in VRAM (LLMs see negligible perf loss; ~10% for some PyTorch workloads).
Apple M‑series (especially Mac mini / Studio) are highlighted as compelling: unified memory, decent bandwidth, low power, very simple setup.
- Counterpoints: weaker memory bandwidth vs high‑end NVIDIA, no CUDA so many research/codebases don’t “just work,” and some generative image models (e.g., Flux) are much slower.
Several commenters feel calling a dual‑P40, ~€1700 build “budget” is misleading; “real budget” is closer to a single mid‑range GPU or repurposed existing hardware.

Cloud vs local: cost, privacy, and risk

Many see local LLMs as an enthusiast hobby with questionable ROI; renting GPUs or using APIs is usually cheaper per unit of useful work, especially as models and hardware obsolete quickly.
Some mix approaches: a local smaller model handles private/tool-calling tasks, escalating non‑sensitive heavy work to cloud models.
Arguments for local:
- Strong privacy (avoiding ToS changes, data leaks, opaque “shadow prompting” and provider-side guardrails).
- Predictable spend vs cloud “tail risk” from misconfigured GPU instances.
Arguments for cloud:
- Better models, higher speed, no hardware/driver headaches, easy to switch as new models appear.

Practical advice & ecosystem state

VRAM needs exceed raw model size due to KV cache and context; over‑provisioning is common practice. Quantized variants (Q4/Q5/Q6) on HuggingFace often list real RAM requirements.
Tiny 7–8B models on phones or low‑RAM laptops are often judged “tinkering only” for serious coding, though some report acceptable work use with careful model choice.
Tools discussed: Ollama, llama.cpp (including distributed/RPC mode), ktransformers for MOE offload, Intel’s IPEX‑LLM, plus various UIs (OpenWebUI, LibreChat).
Overall sentiment: homelab AI is fun and educational, but for most people it’s still a niche, fast‑moving, anecdote‑driven space rather than a clear economic win.

Related topics