2026-03-22

Flash-MoE: Running a 397B Parameter Model on a Laptop

Performance and Setup

Flash-MoE runs Qwen3.5-397B MoE on recent MacBook Pros by streaming ~200+ GB of 4-bit (or lower) weights from SSD via custom Metal kernels.
Reported speeds: ~4–6.5 tokens/s for 397B with aggressive quantization and reduced active experts; some see this as usable if patient, others as effectively too slow.
IO is bursty: SSD can be saturated briefly when loading experts, but average bandwidth is well below peak SSD capabilities.

SSD, IO, and Hardware Constraints

Reads do not significantly wear SSDs; writes do. Some note read-disturb may require occasional rewrites on modern NAND but is considered minor.
Concern: using an internal Mac SSD as 24/7 “model RAM” feels risky/expensive; others argue it’s fine for read-heavy workloads.
PCIe bandwidth is the real limiter for multi-SSD or GPU setups; you can’t move data to the GPU faster just by using DRAM instead of SSD once offloading dominates.

Quantization, Experts, and Quality

Method uses 2-bit (and 4-bit) quantization and reduces experts per token from 10 to 4. Several commenters argue 2-bit “lobotomizes” the model, especially on longer sessions and tool calls (e.g., broken JSON quoting).
Others report success with carefully tuned low-bit quants (~2.5 bpw) on large MoE models, but stress not all 2-bit schemes are equal.
Consensus trend: 4-bit is often the lowest “generally acceptable” level; many prefer 6-bit for reliability over long contexts.

Alternatives and Follow‑ups

There are more practical variants targeting 4-bit and smaller “intelligence-dense” 30B models with hybrid RAM+disk streaming, including MLX-based forks.
Many frameworks (llama.cpp, vLLM, sglang) already support offloading weights to RAM/disk and mixing CPU/GPU, but user experience and performance often degrade sharply once models spill beyond VRAM.

Use Cases and Latency Tolerance

Some argue 4 tok/s is fine for research or low-volume personal use; others note that for iterative coding/agent workflows, 10–100× slower responses erase productivity gains.
A few compare this to offline film rendering: slow but acceptable for “big batch” questions, though LLMs’ interactive nature makes long turnaround risky if prompts or directions are wrong.

Consumer Hardware and Accessibility

Debate over calling high-RAM MacBook Pros “consumer hardware.” They are retail-available but expensive and not typical.
Some note that similar results can be had on other high-end laptops or desktops, but the broader public will likely stick to cloud APIs for cost reasons.

Licensing and AI‑Generated Code

The repo initially lacked a license; discussion notes you cannot redistribute unlicensed code but can likely run it.
Since much code is AI-generated, some suggest it may not be copyrightable at all, though the degree of human contribution is unclear.

Related topics