Building an AI server on a budget

GPU Choice, VRAM, and Bandwidth

  • Many think a 12GB RTX 4070 is a poor long‑term choice for LLMs; 16–32GB+ VRAM is repeatedly cited as the practical minimum for “interesting” models.
  • Several argue a used 3090 (24GB) or 4060 Ti 16GB gives better VRAM-per-dollar than a 4070, especially for at‑home inference.
  • Others point to older server / mining GPUs (Tesla M40, K80, A4000s, MI-series, etc.) as strong VRAM-per-dollar options, but note high power use, heat, and low raw speed.
  • A substantial subthread emphasizes that memory bandwidth, not just VRAM size, heavily affects token generation speed; low-bandwidth cards (e.g. 4060 Ti) are criticized for LLM work.
  • Upcoming Intel workstation GPUs (e.g. B50/B60) excite some as possible cheap, VRAM-heavy inference cards that could reshape the home‑AI market.

System RAM and Overall Build

  • Multiple commenters say 32GB system RAM is insufficient for serious experimentation; 64GB is framed as a practical minimum, 128GB+ ideal.
  • There’s confusion about why people obsess over CPUs but “cheap out” on RAM; some share builds with 96GB+.
  • ECC RAM is recommended by a few for reliability.

Cloud vs Local Economics

  • Several argue owning hardware is rarely cheaper than APIs once electricity and datacenter efficiency are considered; local rigs are seen more as a hobby or for privacy/control.
  • Others note short‑term GPU rentals (RunPod, etc.) as a better use of a ~$1.3k budget if you’re mostly doing inference.
  • For expensive frontier APIs (e.g. Claude Code) some wonder if 24/7 heavy use might justify local hardware, but consensus remains skeptical that home setups beat datacenters economically.

Alternate Architectures and Rigs

  • Examples include:
    • 7× RTX 3060 (12GB each) in a rack for 84GB VRAM, heavily power‑optimized but PCIe‑bandwidth limited.
    • Old mining motherboards with multiple Teslas and cheap server PSUs.
    • Huge‑RAM CPU‑only servers (1.5–2TB) running 671B‑parameter models, but at ~0.5 tokens/s and with NUMA bottlenecks.
  • Unified-memory systems (Macs, Strix Halo, future DGX-style boxes) are discussed; they allow large models but often have low bandwidth and thus slow token rates.

Practical Limits and Use Cases

  • Many insist 12GB VRAM is too limiting for modern, high‑quality models; others ask what useful things people have actually done with such constraints.
  • Reported home uses include:
    • Moderate‑size LLMs for experimentation, function calling, and Home Assistant integration.
    • Image generation and classification (e.g. NSFW filtering on user content).
    • Slow but workable local use on very old or low‑power hardware for curiosity.

Software & Setup Issues

  • Installing CUDA via distro repositories vs Nvidia’s installers is debated; newer toolkits can conflict with library expectations and are painful to manage.
  • Some users struggle with CUDA/cuDNN setup enough to give up; others rely on LLMs to walk them through Linux, drivers, and BIOS issues.

Article Content and Audience

  • A few readers dislike sections that feel LLM‑generated or rehash generic PC‑building advice; they lose trust when content looks autogenerated.
  • Others defend the step‑by‑step build details as ideal for beginners (e.g. people who’ve never built a PC or used Linux), especially when methodology and AI assistance are disclosed.