Building an AI server on a budget
GPU Choice, VRAM, and Bandwidth
- Many think a 12GB RTX 4070 is a poor long‑term choice for LLMs; 16–32GB+ VRAM is repeatedly cited as the practical minimum for “interesting” models.
- Several argue a used 3090 (24GB) or 4060 Ti 16GB gives better VRAM-per-dollar than a 4070, especially for at‑home inference.
- Others point to older server / mining GPUs (Tesla M40, K80, A4000s, MI-series, etc.) as strong VRAM-per-dollar options, but note high power use, heat, and low raw speed.
- A substantial subthread emphasizes that memory bandwidth, not just VRAM size, heavily affects token generation speed; low-bandwidth cards (e.g. 4060 Ti) are criticized for LLM work.
- Upcoming Intel workstation GPUs (e.g. B50/B60) excite some as possible cheap, VRAM-heavy inference cards that could reshape the home‑AI market.
System RAM and Overall Build
- Multiple commenters say 32GB system RAM is insufficient for serious experimentation; 64GB is framed as a practical minimum, 128GB+ ideal.
- There’s confusion about why people obsess over CPUs but “cheap out” on RAM; some share builds with 96GB+.
- ECC RAM is recommended by a few for reliability.
Cloud vs Local Economics
- Several argue owning hardware is rarely cheaper than APIs once electricity and datacenter efficiency are considered; local rigs are seen more as a hobby or for privacy/control.
- Others note short‑term GPU rentals (RunPod, etc.) as a better use of a ~$1.3k budget if you’re mostly doing inference.
- For expensive frontier APIs (e.g. Claude Code) some wonder if 24/7 heavy use might justify local hardware, but consensus remains skeptical that home setups beat datacenters economically.
Alternate Architectures and Rigs
- Examples include:
- 7× RTX 3060 (12GB each) in a rack for 84GB VRAM, heavily power‑optimized but PCIe‑bandwidth limited.
- Old mining motherboards with multiple Teslas and cheap server PSUs.
- Huge‑RAM CPU‑only servers (1.5–2TB) running 671B‑parameter models, but at ~0.5 tokens/s and with NUMA bottlenecks.
- Unified-memory systems (Macs, Strix Halo, future DGX-style boxes) are discussed; they allow large models but often have low bandwidth and thus slow token rates.
Practical Limits and Use Cases
- Many insist 12GB VRAM is too limiting for modern, high‑quality models; others ask what useful things people have actually done with such constraints.
- Reported home uses include:
- Moderate‑size LLMs for experimentation, function calling, and Home Assistant integration.
- Image generation and classification (e.g. NSFW filtering on user content).
- Slow but workable local use on very old or low‑power hardware for curiosity.
Software & Setup Issues
- Installing CUDA via distro repositories vs Nvidia’s installers is debated; newer toolkits can conflict with library expectations and are painful to manage.
- Some users struggle with CUDA/cuDNN setup enough to give up; others rely on LLMs to walk them through Linux, drivers, and BIOS issues.
Article Content and Audience
- A few readers dislike sections that feel LLM‑generated or rehash generic PC‑building advice; they lose trust when content looks autogenerated.
- Others defend the step‑by‑step build details as ideal for beginners (e.g. people who’ve never built a PC or used Linux), especially when methodology and AI assistance are disclosed.