2026-06-13

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Performance & Optimization

Original post: RTX 5080 + 3090 achieving ~80 tok/s on Qwen 3.6 27B Q8 with MTP; others report:
- 2×4090: ~90 tok/s (27B Q8) and ~260 tok/s (35B A3B Q8).
- Single 3090: ~120 tok/s on Qwen 3.6 35B A3B with MTP via LM Studio.
- MacBook M5 Max (MLX): ~20 tok/s on 27B Q8.
- Ryzen AI Max 395+: ~28 tok/s (27B), ~60 tok/s (35B).
Some want more theory: optimal tensor/weight splits, driver issues, and bandwidth utilization.
One commenter notes that a 3090 memory bandwidth calculation suggests significant headroom even at ~30 tok/s.
Recommended Qwen 3.6 sampling and MTP/ngram speculative settings are shared; current article’s settings are criticized as suboptimal.
Discussion on MTP + n‑gram speculative decoding: ordering in CLI vs internal hardcoded priority in llama.cpp.

Hardware Setups, Power, and Cooling

Multiple builds: 5080+3090, 2×3080 20GB, X99 dual x16 boards, Oculink breakouts, eGPU wishes.
Peer‑to‑peer (P2P) on mixed GPUs and consumer chipsets is confusing; some see “CNS” (no chipset support), others recommend patched NVIDIA modules.
Power draw a concern: dual‑GPU nodes hitting 600–700W+, heating rooms; power limiting via nvidia-smi recommended.
Lower‑TDP configs (Ryzen AI, Apple Silicon) noted as “slower but wild for laptops.”

Local vs Cloud: Cost, Privacy, and Regulation

Economic skeptics: cloud inference at ~$3/M tokens seems far cheaper than multi‑k GPU rigs plus electricity and noise.
Counterpoints:
- It’s a hobby; hardware is multi‑use (gaming, rendering).
- Owning hardware hedges against future price hikes, service shutdowns, or regulatory bans on LLM usage.
- Local setups provide privacy, custom control (logprobs, samplers, PeFT), and avoid opaque middlemen.

Model Quality & Use Cases

Qwen 3.6 27B/35B praised as excellent for size, especially with MoE and large contexts; good for coding, KB‑backed tasks, and custom agents (e.g., grocery ordering via browser automation).
Compared to frontier models (Claude, others):
- Frontier models generally still stronger, especially on complex, long CoT tasks and debugging.
- Some prefer smaller local models’ “simpler” failure modes and less overconfident, polished bullshit.
Debate over whether benchmarks overhype small models and how much model size vs RLHF vs architecture really matter.

Related topics