iPhone 17 Pro Demonstrated Running a 400B LLM
Technical approach & model details
- Demonstration uses a ~400B-parameter Mixture-of-Experts (MoE) model, but only a subset of experts is active per token.
- Comments cite ~17B “active parameters” and an “effective dense size” around ~80B; there’s disagreement on how to interpret these numbers.
- Weights are heavily quantized (down to very low bit-widths) and streamed from flash storage using an approach similar to “LLM in a Flash,” relying on OS page cache and SSD bandwidth.
- Only small “expert-layers” are loaded on demand; most experts remain on disk, reducing RAM pressure at the cost of I/O.
Performance, practicality, and constraints
- Throughput is ~0.4–0.6 tokens/second, with long time-to-first-token; many call this a demo or “toy,” not practical for interactive use.
- Storage bandwidth, not raw compute, is the main bottleneck; faster SSDs on newer Macs improve but don’t solve this.
- Battery drain and heat are highlighted as fundamental constraints for phones and tablets.
- Context length and KV-cache growth would further slow things; some argue this makes large models on phones inherently impractical.
Hardware discussion (phones vs datacenter)
- Debate over whether this is primarily a hardware or software achievement; consensus that it’s clever software exploiting strong mobile SoCs.
- Apple’s unified memory, high memory bandwidth, and PoP RAM packaging are praised, but others note similar designs exist across mobile.
- Some claim A-series/M-series are nearing desktop-class performance; others stress that data-center GPUs remain orders of magnitude faster and more energy-efficient.
Software & algorithmic angles
- People see this as an example of “real” engineering (mmap, tiling, OS caching) entering what was previously research/prototyping territory.
- Discussion of future improvements: better MoE routing, cache-friendly expert utilization, KV prediction to reduce prefill latency.
- Acknowledgment that quantization quality varies by method and layer sensitivity.
Usefulness, value, and future directions
- Enthusiasts see it as a remarkable proof-of-concept and a signal that powerful edge models will become commonplace as hardware and smaller architectures improve.
- Skeptics argue that:
- Running such huge models on phones is a gimmick; smaller tuned models are more useful.
- Energy, latency, and context constraints make on-device large LLMs “never” competitive for most workloads.
- Others counter that:
- Edge models matter for privacy, offline use, and avoiding subscriptions/ads.
- Even slow local inference can be useful for batch or non-interactive tasks.
Apple, economics, and ecosystem
- Some suggest Apple could “win” in AI via distribution and tight hardware–software integration rather than giant AI capex.
- Concern that RAM costs and global AI demand will limit how much memory phones can ship with.
- Broader debate on whether the future is:
- Massive proprietary cloud models with thin clients, or
- “Good enough” open(-weight) models running locally plus smaller, more efficient architectures.