How to run Qwen 3.5 locally

Capabilities and Use Cases

  • Many users find Qwen3.5 surprisingly strong for local coding: multi-file edits, building small Rust apps, HTML/CSS work, and “rubber duck” explanations of code and compile errors.
  • 9B and 4B/0.8B variants are used for OCR, text cleanup, simple coding, translation, and log triage, but are considered weak for complex coding or “agentic” multi-step tasks.
  • Qwen models are also used for structured text/image/video analysis (JSON output, NER, categorization) and financial data extraction from emails.
  • Several note that Qwen3.5 is very confident and often sycophantic; personas (e.g., “brief rude senior”, “emotionless vulcan”) help tone it down.

Performance, Hardware, and Context

  • Reports span from tiny SBCs and phones (0.8B/2B CPU-only) to RTX 30/40/50-series, Apple M1–M5, and A100/H100.
  • 9B often reaches ~60–100 tok/s on mid/high-end GPUs; 35B-A3B can run at ~14–25 tok/s on consumer cards with partial offload.
  • 27B and 35B-A3B are widely seen as the sweet spot for “serious” local coding, with 27B sometimes stronger on benchmarks, 35B-A3B faster due to MoE.
  • Long contexts (100k–256k) are possible but some see quality degradation and instruction drift over long sessions, partly due to sliding-window attention.

Quantization and Model Choice

  • Frequent discussion of Q4_K_M / UD-Q4_K_XL vs other 4-bit schemes; Q4_0 and Q4_1 are described as faster but notably less accurate.
  • Rule of thumb in the thread: more parameters at lower bit-depth usually beats fewer parameters at higher bit-depth, down to ~3–4 bits.
  • Users recommend experimenting: 27B at 4-bit or 35B-A3B at 3-bit for 16GB+ VRAM; 9B or 4B for very constrained setups.

Thinking Mode and Reliability

  • “Thinking” mode can run indefinitely or add large latency; multiple users disable or heavily constrain it.
  • Some note improved reasoning and code review with thinking enabled; others find loops, repetition, or crashes in certain orchestrators.

Comparison to Frontier and Overall Sentiment

  • Enthusiasts claim 27B/35B approximate older Claude/GPT tiers or at least Haiku-level for many coding tasks, and can save substantial API costs.
  • Skeptics report that even 122B/397B lag behind top proprietary models (e.g., Claude Opus/Sonnet) on hard coding and mathematical/physical reasoning, and still hallucinate.
  • Consensus: Qwen3.5 is a major step for local models, excellent for many targeted workflows, but not yet a full replacement for state-of-the-art hosted models—especially for complex, long-horizon “agentic” coding.