2026-06-16

Running local models is good now

Hardware & affordability

Many comments say “local is good now” only if you have serious hardware: 24–32 GB VRAM GPUs, 64–128 GB RAM Macs, or Strix Halo / DGX Spark–class boxes.
Others push back: a 2–3k USD machine is out of reach for a large part of the world; for many devs $20–$200/month for a hosted model is cheaper and simpler.
There’s debate over whether laptops (especially thermally constrained ones) are a good idea vs desktop/workstation or small on‑prem servers.

Capabilities of current local models

Qwen 3.6 27B (dense) and 35B MoE, Gemma 4 (26B MoE, 31B dense), GLM 5.x, GPT‑OSS, Granite 4.1, etc. are cited as “very capable,” roughly comparable to mid‑2025 frontier quality for many tasks.
Users report 30–150+ tokens/s on commodity GPUs (e.g., 3090/4090/5090) and high‑RAM Macs, especially with MTP/speculative decoding and good quantization.
Diffusion-style LLMs (e.g., DiffusionGemma) impress some for single‑prompt speed, but researchers in the thread say they likely don’t scale or match dense transformers in quality.

Coding & agent workflows

Strong split in experience: some say Qwen 3.6 27B/35B are “good enough to daily‑drive” for coding; others find them far behind Claude/GPT for non‑toy codebases.
Local models often struggle with:
- Reliable tool calls and JSON output.
- Large context workflows (hundreds of k tokens).
- Self‑directed, long‑horizon “vibe‑coding”; they get stuck, loop, or make subtle design errors.
Many succeed by:
- Using frontier models to plan and local models to execute smaller tasks.
- Keeping prompts extremely specific and scoping work tightly.
- Accepting them as “smart autocomplete / junior dev,” not autonomous architects.

Harnesses, prompts & configuration

Repeated theme: harness matters as much as the model. Pi, OpenCode, Hermes, custom CLIs, and carefully tuned system prompts/AGENTS.md drastically change results.
Quantization is contentious: 4‑bit can “lobotomize” models for tool use; some advocate 5–6 bit or Q8 where possible, especially for MoE.
Many warn that local setups require significant tinkering with flags, quant schemes, context sizes, and templates; expectations should be set accordingly.

Cloud vs local economics & control

Pro‑local arguments:
- No rate limits, rug‑pulls, silent “nerfs,” or model deprecations.
- Better privacy and IP control; useful for regulated or paranoid environments.
- Hardware may depreciate more slowly than people expect given current GPU scarcity.
Pro‑cloud arguments:
- Frontier models (Claude/GPT/DeepSeek) still clearly smarter at complex coding and agentic work.
- Hosted inference can be cheaper and faster for most users, with no hardware or maintenance burden.
- Businesses often prefer to “outsource the headache,” even if on‑prem could be cheaper.

Future outlook

Many expect open/local models to keep improving with better architectures, quantization‑aware training, and inference tricks, but also expect frontier vendors to stay ahead at scale.
There’s interest in:
- On‑prem GPU appliances for teams (e.g., “LLM in a closet”).
- Hybrid patterns: local for routine/private/low‑latency work, cloud for heavy planning or huge contexts.
Some see this as a pivotal moment: local is already “good enough” for a surprising share of everyday tasks, though not yet a full replacement for top cloud models.

Related topics