RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Performance & Optimization

  • Original post: RTX 5080 + 3090 achieving ~80 tok/s on Qwen 3.6 27B Q8 with MTP; others report:
    • 2×4090: ~90 tok/s (27B Q8) and ~260 tok/s (35B A3B Q8).
    • Single 3090: ~120 tok/s on Qwen 3.6 35B A3B with MTP via LM Studio.
    • MacBook M5 Max (MLX): ~20 tok/s on 27B Q8.
    • Ryzen AI Max 395+: ~28 tok/s (27B), ~60 tok/s (35B).
  • Some want more theory: optimal tensor/weight splits, driver issues, and bandwidth utilization.
  • One commenter notes that a 3090 memory bandwidth calculation suggests significant headroom even at ~30 tok/s.
  • Recommended Qwen 3.6 sampling and MTP/ngram speculative settings are shared; current article’s settings are criticized as suboptimal.
  • Discussion on MTP + n‑gram speculative decoding: ordering in CLI vs internal hardcoded priority in llama.cpp.

Hardware Setups, Power, and Cooling

  • Multiple builds: 5080+3090, 2×3080 20GB, X99 dual x16 boards, Oculink breakouts, eGPU wishes.
  • Peer‑to‑peer (P2P) on mixed GPUs and consumer chipsets is confusing; some see “CNS” (no chipset support), others recommend patched NVIDIA modules.
  • Power draw a concern: dual‑GPU nodes hitting 600–700W+, heating rooms; power limiting via nvidia-smi recommended.
  • Lower‑TDP configs (Ryzen AI, Apple Silicon) noted as “slower but wild for laptops.”

Local vs Cloud: Cost, Privacy, and Regulation

  • Economic skeptics: cloud inference at ~$3/M tokens seems far cheaper than multi‑k GPU rigs plus electricity and noise.
  • Counterpoints:
    • It’s a hobby; hardware is multi‑use (gaming, rendering).
    • Owning hardware hedges against future price hikes, service shutdowns, or regulatory bans on LLM usage.
    • Local setups provide privacy, custom control (logprobs, samplers, PeFT), and avoid opaque middlemen.

Model Quality & Use Cases

  • Qwen 3.6 27B/35B praised as excellent for size, especially with MoE and large contexts; good for coding, KB‑backed tasks, and custom agents (e.g., grocery ordering via browser automation).
  • Compared to frontier models (Claude, others):
    • Frontier models generally still stronger, especially on complex, long CoT tasks and debugging.
    • Some prefer smaller local models’ “simpler” failure modes and less overconfident, polished bullshit.
  • Debate over whether benchmarks overhype small models and how much model size vs RLHF vs architecture really matter.