iPhone 17 Pro Demonstrated Running a 400B LLM

Technical approach & model details

  • Demonstration uses a ~400B-parameter Mixture-of-Experts (MoE) model, but only a subset of experts is active per token.
  • Comments cite ~17B “active parameters” and an “effective dense size” around ~80B; there’s disagreement on how to interpret these numbers.
  • Weights are heavily quantized (down to very low bit-widths) and streamed from flash storage using an approach similar to “LLM in a Flash,” relying on OS page cache and SSD bandwidth.
  • Only small “expert-layers” are loaded on demand; most experts remain on disk, reducing RAM pressure at the cost of I/O.

Performance, practicality, and constraints

  • Throughput is ~0.4–0.6 tokens/second, with long time-to-first-token; many call this a demo or “toy,” not practical for interactive use.
  • Storage bandwidth, not raw compute, is the main bottleneck; faster SSDs on newer Macs improve but don’t solve this.
  • Battery drain and heat are highlighted as fundamental constraints for phones and tablets.
  • Context length and KV-cache growth would further slow things; some argue this makes large models on phones inherently impractical.

Hardware discussion (phones vs datacenter)

  • Debate over whether this is primarily a hardware or software achievement; consensus that it’s clever software exploiting strong mobile SoCs.
  • Apple’s unified memory, high memory bandwidth, and PoP RAM packaging are praised, but others note similar designs exist across mobile.
  • Some claim A-series/M-series are nearing desktop-class performance; others stress that data-center GPUs remain orders of magnitude faster and more energy-efficient.

Software & algorithmic angles

  • People see this as an example of “real” engineering (mmap, tiling, OS caching) entering what was previously research/prototyping territory.
  • Discussion of future improvements: better MoE routing, cache-friendly expert utilization, KV prediction to reduce prefill latency.
  • Acknowledgment that quantization quality varies by method and layer sensitivity.

Usefulness, value, and future directions

  • Enthusiasts see it as a remarkable proof-of-concept and a signal that powerful edge models will become commonplace as hardware and smaller architectures improve.
  • Skeptics argue that:
    • Running such huge models on phones is a gimmick; smaller tuned models are more useful.
    • Energy, latency, and context constraints make on-device large LLMs “never” competitive for most workloads.
  • Others counter that:
    • Edge models matter for privacy, offline use, and avoiding subscriptions/ads.
    • Even slow local inference can be useful for batch or non-interactive tasks.

Apple, economics, and ecosystem

  • Some suggest Apple could “win” in AI via distribution and tight hardware–software integration rather than giant AI capex.
  • Concern that RAM costs and global AI demand will limit how much memory phones can ship with.
  • Broader debate on whether the future is:
    • Massive proprietary cloud models with thin clients, or
    • “Good enough” open(-weight) models running locally plus smaller, more efficient architectures.