Qwen3.5 122B and 35B models offer Sonnet 4.5 performance on local computers

Chinese vs non-Chinese models & trust

  • Some want to avoid Chinese models for geopolitical or regulatory reasons, especially when handling sensitive customer data, regardless of “open weights.”
  • Others argue provenance of weights matters less than where inference is hosted and that openness makes Chinese models more trustworthy than US closed models.
  • There is concern that Chinese LLMs are aligned to government narratives on censored topics; others note US models also embed propaganda, just of a different flavor.
  • Several EU-based commenters say they trust China more than the US on foreign policy, highlighting how trust is highly contextual and political.

How close to Sonnet 4.5? Benchmarks vs real use

  • Many doubt the headline claim that Qwen3.5 (122B/35B) matches Claude Sonnet 4.5 overall.
  • Shared evals suggest performance roughly between Claude Haiku 4.5 and Sonnet 4.5, with some saying the title should have referenced Haiku instead.
  • Some report Qwen3.5-27B performing near Sonnet 4.0 on reasoning benchmarks; 397B variants are compared to older Opus versions, not current frontier models.
  • Multiple commenters argue benchmarks are heavily “benchmaxed” (benchmarks likely in training data), so real-world performance lags advertised scores.

Model behavior & reasoning quirks

  • Qwen3.5 often enters long, verbose “planning” or “thinking” loops (e.g., struggling with trivial “potato 100 times” requests) unless given strong system prompts and tuned sampling parameters.
  • Users note impressive persistence and tool-use capabilities for coding, but also brittle behavior and weird loops, especially under default settings or buggy runtimes.
  • Opinions on specific variants diverge: several praise 27B dense as “best local-sized model,” while some call 35B A3B “fast but bad,” others find it very effective.

Hardware, quantization & runtimes

  • Practical configs range from:
    • Single 24GB Nvidia cards (A5000/3090/4090/5090) running 27B/35B at Q4 with decent context and speed.
    • 96GB RTX 6000-class cards enabling larger models or longer context windows.
    • High-RAM Macs (M-series 32–128GB) using MLX/llama.cpp, though thermals and long tasks can cause severe slowdowns.
    • AMD GPUs via llama.cpp (HIP/Vulkan) and workstation Radeon AI PRO cards.
  • 4-bit quantization (especially Unsloth and other advanced schemes) is widely seen as the sweet spot for local use; Qwen3.5 is reported to be unusually tolerant of quantization.
  • Some note misleading marketing around “80GB VRAM is enough,” since full-precision GGUFs are enormous and require aggressive quantization.

Use cases: where local models work well vs not

  • Strongest use cases: narrow, well-specified coding tasks, tooling/agent backends, prompt expansion, translation, formatting, sentiment analysis, image captioning, and home/office automations.
  • Several report surprisingly good coding (e.g., full SPA calculators, custom PCA in Polars) on Qwen3.5 and related coder variants.
  • For deep research, ambiguous problem-solving, and complex agentic workflows, frontier cloud models (Claude Opus/Sonnet, Gemini, etc.) are still widely considered clearly superior.
  • Some teams must avoid cloud entirely; for them, rapid progress in open/self-hosted models is already practically valuable despite the gap to frontier models.

Tooling, runtimes & ecosystem issues

  • Popular stacks: llama.cpp, MLX, LM Studio, OpenCode, OpenWebUI, Swival, and various GGUF quants on Hugging Face and Unsloth.
  • Ollama’s Qwen3.5 integration is reported buggy (looping, mis-set parameters), so users are warned not to judge the model solely via Ollama.
  • Commenters emphasize inference is “knob-heavy”: temperature, top-p/k, min-p, penalties, templates, and runtime bugs can drastically affect apparent quality.
  • Several predict continued fast improvement; others insist that, today, no local/open model consistently matches the breadth and reliability of Sonnet 4.5 across varied tasks.