2026-02-28

Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

Chinese vs non-Chinese models & trust

Some want to avoid Chinese models for geopolitical or regulatory reasons, especially when handling sensitive customer data, regardless of “open weights.”
Others argue provenance of weights matters less than where inference is hosted and that openness makes Chinese models more trustworthy than US closed models.
There is concern that Chinese LLMs are aligned to government narratives on censored topics; others note US models also embed propaganda, just of a different flavor.
Several EU-based commenters say they trust China more than the US on foreign policy, highlighting how trust is highly contextual and political.

How close to Sonnet 4.5? Benchmarks vs real use

Many doubt the headline claim that Qwen3.5 (122B/35B) matches Claude Sonnet 4.5 overall.
Shared evals suggest performance roughly between Claude Haiku 4.5 and Sonnet 4.5, with some saying the title should have referenced Haiku instead.
Some report Qwen3.5-27B performing near Sonnet 4.0 on reasoning benchmarks; 397B variants are compared to older Opus versions, not current frontier models.
Multiple commenters argue benchmarks are heavily “benchmaxed” (benchmarks likely in training data), so real-world performance lags advertised scores.

Model behavior & reasoning quirks

Qwen3.5 often enters long, verbose “planning” or “thinking” loops (e.g., struggling with trivial “potato 100 times” requests) unless given strong system prompts and tuned sampling parameters.
Users note impressive persistence and tool-use capabilities for coding, but also brittle behavior and weird loops, especially under default settings or buggy runtimes.
Opinions on specific variants diverge: several praise 27B dense as “best local-sized model,” while some call 35B A3B “fast but bad,” others find it very effective.

Hardware, quantization & runtimes

Practical configs range from:
- Single 24GB Nvidia cards (A5000/3090/4090/5090) running 27B/35B at Q4 with decent context and speed.
- 96GB RTX 6000-class cards enabling larger models or longer context windows.
- High-RAM Macs (M-series 32–128GB) using MLX/llama.cpp, though thermals and long tasks can cause severe slowdowns.
- AMD GPUs via llama.cpp (HIP/Vulkan) and workstation Radeon AI PRO cards.
4-bit quantization (especially Unsloth and other advanced schemes) is widely seen as the sweet spot for local use; Qwen3.5 is reported to be unusually tolerant of quantization.
Some note misleading marketing around “80GB VRAM is enough,” since full-precision GGUFs are enormous and require aggressive quantization.

Use cases: where local models work well vs not

Strongest use cases: narrow, well-specified coding tasks, tooling/agent backends, prompt expansion, translation, formatting, sentiment analysis, image captioning, and home/office automations.
Several report surprisingly good coding (e.g., full SPA calculators, custom PCA in Polars) on Qwen3.5 and related coder variants.
For deep research, ambiguous problem-solving, and complex agentic workflows, frontier cloud models (Claude Opus/Sonnet, Gemini, etc.) are still widely considered clearly superior.
Some teams must avoid cloud entirely; for them, rapid progress in open/self-hosted models is already practically valuable despite the gap to frontier models.

Tooling, runtimes & ecosystem issues

Popular stacks: llama.cpp, MLX, LM Studio, OpenCode, OpenWebUI, Swival, and various GGUF quants on Hugging Face and Unsloth.
Ollama’s Qwen3.5 integration is reported buggy (looping, mis-set parameters), so users are warned not to judge the model solely via Ollama.
Commenters emphasize inference is “knob-heavy”: temperature, top-p/k, min-p, penalties, templates, and runtime bugs can drastically affect apparent quality.
Several predict continued fast improvement; others insist that, today, no local/open model consistently matches the breadth and reliability of Sonnet 4.5 across varied tasks.

Related topics