Running local models is good now
Hardware & affordability
- Many comments say “local is good now” only if you have serious hardware: 24–32 GB VRAM GPUs, 64–128 GB RAM Macs, or Strix Halo / DGX Spark–class boxes.
- Others push back: a 2–3k USD machine is out of reach for a large part of the world; for many devs $20–$200/month for a hosted model is cheaper and simpler.
- There’s debate over whether laptops (especially thermally constrained ones) are a good idea vs desktop/workstation or small on‑prem servers.
Capabilities of current local models
- Qwen 3.6 27B (dense) and 35B MoE, Gemma 4 (26B MoE, 31B dense), GLM 5.x, GPT‑OSS, Granite 4.1, etc. are cited as “very capable,” roughly comparable to mid‑2025 frontier quality for many tasks.
- Users report 30–150+ tokens/s on commodity GPUs (e.g., 3090/4090/5090) and high‑RAM Macs, especially with MTP/speculative decoding and good quantization.
- Diffusion-style LLMs (e.g., DiffusionGemma) impress some for single‑prompt speed, but researchers in the thread say they likely don’t scale or match dense transformers in quality.
Coding & agent workflows
- Strong split in experience: some say Qwen 3.6 27B/35B are “good enough to daily‑drive” for coding; others find them far behind Claude/GPT for non‑toy codebases.
- Local models often struggle with:
- Reliable tool calls and JSON output.
- Large context workflows (hundreds of k tokens).
- Self‑directed, long‑horizon “vibe‑coding”; they get stuck, loop, or make subtle design errors.
- Many succeed by:
- Using frontier models to plan and local models to execute smaller tasks.
- Keeping prompts extremely specific and scoping work tightly.
- Accepting them as “smart autocomplete / junior dev,” not autonomous architects.
Harnesses, prompts & configuration
- Repeated theme: harness matters as much as the model. Pi, OpenCode, Hermes, custom CLIs, and carefully tuned system prompts/AGENTS.md drastically change results.
- Quantization is contentious: 4‑bit can “lobotomize” models for tool use; some advocate 5–6 bit or Q8 where possible, especially for MoE.
- Many warn that local setups require significant tinkering with flags, quant schemes, context sizes, and templates; expectations should be set accordingly.
Cloud vs local economics & control
- Pro‑local arguments:
- No rate limits, rug‑pulls, silent “nerfs,” or model deprecations.
- Better privacy and IP control; useful for regulated or paranoid environments.
- Hardware may depreciate more slowly than people expect given current GPU scarcity.
- Pro‑cloud arguments:
- Frontier models (Claude/GPT/DeepSeek) still clearly smarter at complex coding and agentic work.
- Hosted inference can be cheaper and faster for most users, with no hardware or maintenance burden.
- Businesses often prefer to “outsource the headache,” even if on‑prem could be cheaper.
Future outlook
- Many expect open/local models to keep improving with better architectures, quantization‑aware training, and inference tricks, but also expect frontier vendors to stay ahead at scale.
- There’s interest in:
- On‑prem GPU appliances for teams (e.g., “LLM in a closet”).
- Hybrid patterns: local for routine/private/low‑latency work, cloud for heavy planning or huge contexts.
- Some see this as a pivotal moment: local is already “good enough” for a surprising share of everyday tasks, though not yet a full replacement for top cloud models.