Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference
Performance: iPhone vs Android / Desktop
- Multiple reports say Gemma 4 runs on both iPhone and Android with similar speed on current flagships; one person finds Pixel slightly faster than iPhone 15.
- iPhones are reported to thermal throttle on long responses, slowing token generation after a while, whereas a newer Pixel keeps going.
- Benchmarks from Edge Gallery on iPhone 16 Pro: ~231 tokens/s prefill, ~16 tokens/s decode, ~1.16s to first token (GPU backend, 4B model).
- Desktop users run much larger models (e.g., 26B and 122B) on 64–128 GB RAM systems, achieving ~35–40 tokens/s and using them as daily drivers, but these are far beyond phone capabilities.
Coherence and Practical Usefulness of Local Models
- Several users find Gemma 4 edge models (E2B/E4B), Qwen 3.5 9B/27B, and others coherent and useful for: simple commands, tone-polishing emails, moderate coding, security/OS work, and even some tax/legal-style reasoning.
- Others remain skeptical, saying small on-device models are still weaker than top cloud models and advising caution for factual questions (e.g., pet safety).
- On phones, heavy tasks or long contexts quickly hit thermal and battery limits; commenters expect more realistic use in short, focused interactions or tiny specialized models.
Apple Ecosystem: App Store Rules and ANE Limitations
- Some developers report Apple blocking or slowing updates to apps that embed local LLMs, citing guideline 2.5.2 about downloading/executing new code.
- Others note existing apps that still run Gemma locally but say Apple has been “slowly cutting them off” and may get stricter as LLMs threaten some app categories.
- There is debate over whether Apple’s Neural Engine (ANE) is a practical target for LLMs; current Gemma demos often use the GPU instead, causing higher power draw and heat.
- Some expect WWDC changes, with rumors of a new AI framework replacing Core ML to better support LLMs.
“Edge” vs “On-Device” Definitions
- Commenters disagree on terminology: some insist “edge” means near-user but not on-device, others argue the user’s device is the “ultimate edge.”
- Consensus: marketing uses the term loosely and inconsistently.
Critique of the Article and AI-Generated Content
- Several call the article shallow “marketing slop” with no benchmarks or real detail, flagged as clickbait.
- Multiple users suspect it is LLM-written, citing repeated rhetorical patterns; AI detectors are invoked, then challenged as unreliable and fundamentally limited.
Example Applications and Experiments
- Projects mentioned include:
- A visual description app for blind users using Gemma 4 E2B, reportedly faster than some cloud tools.
- An offline “pocket vibe coder” on iPhone using Gemma 4 to generate and locally compile a TypeScript file for small interactive apps.