Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference

Performance: iPhone vs Android / Desktop

  • Multiple reports say Gemma 4 runs on both iPhone and Android with similar speed on current flagships; one person finds Pixel slightly faster than iPhone 15.
  • iPhones are reported to thermal throttle on long responses, slowing token generation after a while, whereas a newer Pixel keeps going.
  • Benchmarks from Edge Gallery on iPhone 16 Pro: ~231 tokens/s prefill, ~16 tokens/s decode, ~1.16s to first token (GPU backend, 4B model).
  • Desktop users run much larger models (e.g., 26B and 122B) on 64–128 GB RAM systems, achieving ~35–40 tokens/s and using them as daily drivers, but these are far beyond phone capabilities.

Coherence and Practical Usefulness of Local Models

  • Several users find Gemma 4 edge models (E2B/E4B), Qwen 3.5 9B/27B, and others coherent and useful for: simple commands, tone-polishing emails, moderate coding, security/OS work, and even some tax/legal-style reasoning.
  • Others remain skeptical, saying small on-device models are still weaker than top cloud models and advising caution for factual questions (e.g., pet safety).
  • On phones, heavy tasks or long contexts quickly hit thermal and battery limits; commenters expect more realistic use in short, focused interactions or tiny specialized models.

Apple Ecosystem: App Store Rules and ANE Limitations

  • Some developers report Apple blocking or slowing updates to apps that embed local LLMs, citing guideline 2.5.2 about downloading/executing new code.
  • Others note existing apps that still run Gemma locally but say Apple has been “slowly cutting them off” and may get stricter as LLMs threaten some app categories.
  • There is debate over whether Apple’s Neural Engine (ANE) is a practical target for LLMs; current Gemma demos often use the GPU instead, causing higher power draw and heat.
  • Some expect WWDC changes, with rumors of a new AI framework replacing Core ML to better support LLMs.

“Edge” vs “On-Device” Definitions

  • Commenters disagree on terminology: some insist “edge” means near-user but not on-device, others argue the user’s device is the “ultimate edge.”
  • Consensus: marketing uses the term loosely and inconsistently.

Critique of the Article and AI-Generated Content

  • Several call the article shallow “marketing slop” with no benchmarks or real detail, flagged as clickbait.
  • Multiple users suspect it is LLM-written, citing repeated rhetorical patterns; AI detectors are invoked, then challenged as unreliable and fundamentally limited.

Example Applications and Experiments

  • Projects mentioned include:
    • A visual description app for blind users using Gemma 4 E2B, reportedly faster than some cloud tools.
    • An offline “pocket vibe coder” on iPhone using Gemma 4 to generate and locally compile a TypeScript file for small interactive apps.