A 10 year old Xeon is all you need
Performance of Gemma 4 on Old Xeons
- Reported generation speed around 12 tokens/s under load, with author estimating ~20+ tokens/s on an unloaded system; prompt in example was very short, so benchmarks not fully representative.
- Some commenters argue this is “reading-speed adequate” and usable for background or casual tasks; others say it’s far too slow for serious workloads, especially large prompts or bulk processing.
- Comparisons: GPUs can reach hundreds to thousands of tokens/s; even older GPUs (e.g., MI50, 1070, 2060, 1080 Ti) are reported to massively outperform CPUs for similar models when VRAM allows.
- Several note that prefill/prompt-processing speed and time-to-first-token are missing and crucial for judging usability.
Hardware Details and DDR3/DDR4 Confusion
- Multiple commenters point out that the specific Xeon model cited officially supports DDR4 only; DDR3 mention is likely an error or refers to unusual boards/CPUs that support both.
- Some link Intel ARK pages and mention rare OEM or AliExpress boards that run v3/v4 Xeons with DDR3, but not the exact CPU in the article.
- Others share success running LLMs on a variety of old Xeons (E5 v1–v4, Westmere, Sandy/Ivy E3s) and mixed DDR3/DDR4 setups, often with large RAM capacities (128–768 GB).
Energy Use, Cost, and Practicality
- Debate over whether reusing old servers is eco‑ and cost‑efficient:
- Critics: old Xeons can idle at 150–250W+, noisy in 1U/2U cases, and may cost more in electricity than a hosted LLM subscription.
- Defenders: some chips in workstation cases draw closer to 80–90W under CPU-heavy load, are quiet, and embodied energy of buying new hardware is nontrivial.
- Several note that performance-per-watt is much better on modern consumer CPUs/GPUs, but old gear is cheap or already sunk cost.
Use Cases for Slow Local Models
- Suggested good fits: offline or privacy‑sensitive workloads (medical/legal, internal business tools), overnight document/log analysis, PDF extraction, background agents.
- Too slow for: real-time coding assistance, autocomplete, large-context interactive chat, or heavy image/vision tasks.
- Some are experimenting with hybrid setups: small local models for most tasks, occasional calls to cloud frontier models when stuck.
Speculative Decoding, MoE, and Optimizations
- Discussion of speculative decoding/MTP:
- Clarifications that draft tokens are still verified by the main model; low acceptance thresholds affect speed, not correctness.
- For MoE models, flags like
--cpu-moe, memory layout, and expert routing aim to reduce cache and memory bandwidth pressure.
- Several suggest using llama.cpp’s
llama-benchor the fork’sllama-sweep-benchto report prefill and decode speeds consistently.
Local vs Cloud AI and Market Trajectory
- Many see rapidly improving local models as eroding the moat of centralized AI providers; expect “good enough” open models to run on consumer or homelab hardware.
- Counterpoints:
- Training and serving frontier models remain capital‑ and energy‑intensive.
- Cloud still wins on sheer throughput and convenience for most users.
- Some expect advertising-driven business models and “enshittification” in cloud LLMs, increasing incentives for local use.
- Several predict a future with dedicated local AI boxes (akin to a NAS or home router) shared by a household or small business.
Other Observations
- Multiple complaints about the blog’s layout, color scheme, and nonstandard scrolling.
- Many anecdotes about cheap, powerful used workstations/servers and a revived interest in repurposing them for AI and homelab work.