Running local models is good now

Hardware & affordability

  • Many comments say “local is good now” only if you have serious hardware: 24–32 GB VRAM GPUs, 64–128 GB RAM Macs, or Strix Halo / DGX Spark–class boxes.
  • Others push back: a 2–3k USD machine is out of reach for a large part of the world; for many devs $20–$200/month for a hosted model is cheaper and simpler.
  • There’s debate over whether laptops (especially thermally constrained ones) are a good idea vs desktop/workstation or small on‑prem servers.

Capabilities of current local models

  • Qwen 3.6 27B (dense) and 35B MoE, Gemma 4 (26B MoE, 31B dense), GLM 5.x, GPT‑OSS, Granite 4.1, etc. are cited as “very capable,” roughly comparable to mid‑2025 frontier quality for many tasks.
  • Users report 30–150+ tokens/s on commodity GPUs (e.g., 3090/4090/5090) and high‑RAM Macs, especially with MTP/speculative decoding and good quantization.
  • Diffusion-style LLMs (e.g., DiffusionGemma) impress some for single‑prompt speed, but researchers in the thread say they likely don’t scale or match dense transformers in quality.

Coding & agent workflows

  • Strong split in experience: some say Qwen 3.6 27B/35B are “good enough to daily‑drive” for coding; others find them far behind Claude/GPT for non‑toy codebases.
  • Local models often struggle with:
    • Reliable tool calls and JSON output.
    • Large context workflows (hundreds of k tokens).
    • Self‑directed, long‑horizon “vibe‑coding”; they get stuck, loop, or make subtle design errors.
  • Many succeed by:
    • Using frontier models to plan and local models to execute smaller tasks.
    • Keeping prompts extremely specific and scoping work tightly.
    • Accepting them as “smart autocomplete / junior dev,” not autonomous architects.

Harnesses, prompts & configuration

  • Repeated theme: harness matters as much as the model. Pi, OpenCode, Hermes, custom CLIs, and carefully tuned system prompts/AGENTS.md drastically change results.
  • Quantization is contentious: 4‑bit can “lobotomize” models for tool use; some advocate 5–6 bit or Q8 where possible, especially for MoE.
  • Many warn that local setups require significant tinkering with flags, quant schemes, context sizes, and templates; expectations should be set accordingly.

Cloud vs local economics & control

  • Pro‑local arguments:
    • No rate limits, rug‑pulls, silent “nerfs,” or model deprecations.
    • Better privacy and IP control; useful for regulated or paranoid environments.
    • Hardware may depreciate more slowly than people expect given current GPU scarcity.
  • Pro‑cloud arguments:
    • Frontier models (Claude/GPT/DeepSeek) still clearly smarter at complex coding and agentic work.
    • Hosted inference can be cheaper and faster for most users, with no hardware or maintenance burden.
    • Businesses often prefer to “outsource the headache,” even if on‑prem could be cheaper.

Future outlook

  • Many expect open/local models to keep improving with better architectures, quantization‑aware training, and inference tricks, but also expect frontier vendors to stay ahead at scale.
  • There’s interest in:
    • On‑prem GPU appliances for teams (e.g., “LLM in a closet”).
    • Hybrid patterns: local for routine/private/low‑latency work, cloud for heavy planning or huge contexts.
  • Some see this as a pivotal moment: local is already “good enough” for a surprising share of everyday tasks, though not yet a full replacement for top cloud models.