A 10 year old Xeon is all you need

Performance of Gemma 4 on Old Xeons

  • Reported generation speed around 12 tokens/s under load, with author estimating ~20+ tokens/s on an unloaded system; prompt in example was very short, so benchmarks not fully representative.
  • Some commenters argue this is “reading-speed adequate” and usable for background or casual tasks; others say it’s far too slow for serious workloads, especially large prompts or bulk processing.
  • Comparisons: GPUs can reach hundreds to thousands of tokens/s; even older GPUs (e.g., MI50, 1070, 2060, 1080 Ti) are reported to massively outperform CPUs for similar models when VRAM allows.
  • Several note that prefill/prompt-processing speed and time-to-first-token are missing and crucial for judging usability.

Hardware Details and DDR3/DDR4 Confusion

  • Multiple commenters point out that the specific Xeon model cited officially supports DDR4 only; DDR3 mention is likely an error or refers to unusual boards/CPUs that support both.
  • Some link Intel ARK pages and mention rare OEM or AliExpress boards that run v3/v4 Xeons with DDR3, but not the exact CPU in the article.
  • Others share success running LLMs on a variety of old Xeons (E5 v1–v4, Westmere, Sandy/Ivy E3s) and mixed DDR3/DDR4 setups, often with large RAM capacities (128–768 GB).

Energy Use, Cost, and Practicality

  • Debate over whether reusing old servers is eco‑ and cost‑efficient:
    • Critics: old Xeons can idle at 150–250W+, noisy in 1U/2U cases, and may cost more in electricity than a hosted LLM subscription.
    • Defenders: some chips in workstation cases draw closer to 80–90W under CPU-heavy load, are quiet, and embodied energy of buying new hardware is nontrivial.
  • Several note that performance-per-watt is much better on modern consumer CPUs/GPUs, but old gear is cheap or already sunk cost.

Use Cases for Slow Local Models

  • Suggested good fits: offline or privacy‑sensitive workloads (medical/legal, internal business tools), overnight document/log analysis, PDF extraction, background agents.
  • Too slow for: real-time coding assistance, autocomplete, large-context interactive chat, or heavy image/vision tasks.
  • Some are experimenting with hybrid setups: small local models for most tasks, occasional calls to cloud frontier models when stuck.

Speculative Decoding, MoE, and Optimizations

  • Discussion of speculative decoding/MTP:
    • Clarifications that draft tokens are still verified by the main model; low acceptance thresholds affect speed, not correctness.
    • For MoE models, flags like --cpu-moe, memory layout, and expert routing aim to reduce cache and memory bandwidth pressure.
  • Several suggest using llama.cpp’s llama-bench or the fork’s llama-sweep-bench to report prefill and decode speeds consistently.

Local vs Cloud AI and Market Trajectory

  • Many see rapidly improving local models as eroding the moat of centralized AI providers; expect “good enough” open models to run on consumer or homelab hardware.
  • Counterpoints:
    • Training and serving frontier models remain capital‑ and energy‑intensive.
    • Cloud still wins on sheer throughput and convenience for most users.
    • Some expect advertising-driven business models and “enshittification” in cloud LLMs, increasing incentives for local use.
  • Several predict a future with dedicated local AI boxes (akin to a NAS or home router) shared by a household or small business.

Other Observations

  • Multiple complaints about the blog’s layout, color scheme, and nonstandard scrolling.
  • Many anecdotes about cheap, powerful used workstations/servers and a revived interest in repurposing them for AI and homelab work.