RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8
Performance & Optimization
- Original post: RTX 5080 + 3090 achieving ~80 tok/s on Qwen 3.6 27B Q8 with MTP; others report:
- 2×4090: ~90 tok/s (27B Q8) and ~260 tok/s (35B A3B Q8).
- Single 3090: ~120 tok/s on Qwen 3.6 35B A3B with MTP via LM Studio.
- MacBook M5 Max (MLX): ~20 tok/s on 27B Q8.
- Ryzen AI Max 395+: ~28 tok/s (27B), ~60 tok/s (35B).
- Some want more theory: optimal tensor/weight splits, driver issues, and bandwidth utilization.
- One commenter notes that a 3090 memory bandwidth calculation suggests significant headroom even at ~30 tok/s.
- Recommended Qwen 3.6 sampling and MTP/ngram speculative settings are shared; current article’s settings are criticized as suboptimal.
- Discussion on MTP + n‑gram speculative decoding: ordering in CLI vs internal hardcoded priority in llama.cpp.
Hardware Setups, Power, and Cooling
- Multiple builds: 5080+3090, 2×3080 20GB, X99 dual x16 boards, Oculink breakouts, eGPU wishes.
- Peer‑to‑peer (P2P) on mixed GPUs and consumer chipsets is confusing; some see “CNS” (no chipset support), others recommend patched NVIDIA modules.
- Power draw a concern: dual‑GPU nodes hitting 600–700W+, heating rooms; power limiting via
nvidia-smirecommended. - Lower‑TDP configs (Ryzen AI, Apple Silicon) noted as “slower but wild for laptops.”
Local vs Cloud: Cost, Privacy, and Regulation
- Economic skeptics: cloud inference at ~$3/M tokens seems far cheaper than multi‑k GPU rigs plus electricity and noise.
- Counterpoints:
- It’s a hobby; hardware is multi‑use (gaming, rendering).
- Owning hardware hedges against future price hikes, service shutdowns, or regulatory bans on LLM usage.
- Local setups provide privacy, custom control (logprobs, samplers, PeFT), and avoid opaque middlemen.
Model Quality & Use Cases
- Qwen 3.6 27B/35B praised as excellent for size, especially with MoE and large contexts; good for coding, KB‑backed tasks, and custom agents (e.g., grocery ordering via browser automation).
- Compared to frontier models (Claude, others):
- Frontier models generally still stronger, especially on complex, long CoT tasks and debugging.
- Some prefer smaller local models’ “simpler” failure modes and less overconfident, polished bullshit.
- Debate over whether benchmarks overhype small models and how much model size vs RLHF vs architecture really matter.