2026-04-15

Google Gemma 4 Runs Natively on iPhone with Full Offline AI Inference

Performance: iPhone vs Android / Desktop

Multiple reports say Gemma 4 runs on both iPhone and Android with similar speed on current flagships; one person finds Pixel slightly faster than iPhone 15.
iPhones are reported to thermal throttle on long responses, slowing token generation after a while, whereas a newer Pixel keeps going.
Benchmarks from Edge Gallery on iPhone 16 Pro: ~231 tokens/s prefill, ~16 tokens/s decode, ~1.16s to first token (GPU backend, 4B model).
Desktop users run much larger models (e.g., 26B and 122B) on 64–128 GB RAM systems, achieving ~35–40 tokens/s and using them as daily drivers, but these are far beyond phone capabilities.

Coherence and Practical Usefulness of Local Models

Several users find Gemma 4 edge models (E2B/E4B), Qwen 3.5 9B/27B, and others coherent and useful for: simple commands, tone-polishing emails, moderate coding, security/OS work, and even some tax/legal-style reasoning.
Others remain skeptical, saying small on-device models are still weaker than top cloud models and advising caution for factual questions (e.g., pet safety).
On phones, heavy tasks or long contexts quickly hit thermal and battery limits; commenters expect more realistic use in short, focused interactions or tiny specialized models.

Apple Ecosystem: App Store Rules and ANE Limitations

Some developers report Apple blocking or slowing updates to apps that embed local LLMs, citing guideline 2.5.2 about downloading/executing new code.
Others note existing apps that still run Gemma locally but say Apple has been “slowly cutting them off” and may get stricter as LLMs threaten some app categories.
There is debate over whether Apple’s Neural Engine (ANE) is a practical target for LLMs; current Gemma demos often use the GPU instead, causing higher power draw and heat.
Some expect WWDC changes, with rumors of a new AI framework replacing Core ML to better support LLMs.

“Edge” vs “On-Device” Definitions

Commenters disagree on terminology: some insist “edge” means near-user but not on-device, others argue the user’s device is the “ultimate edge.”
Consensus: marketing uses the term loosely and inconsistently.

Critique of the Article and AI-Generated Content

Several call the article shallow “marketing slop” with no benchmarks or real detail, flagged as clickbait.
Multiple users suspect it is LLM-written, citing repeated rhetorical patterns; AI detectors are invoked, then challenged as unreliable and fundamentally limited.

Example Applications and Experiments

Projects mentioned include:
- A visual description app for blind users using Gemma 4 E2B, reportedly faster than some cloud tools.
- An offline “pocket vibe coder” on iPhone using Gemma 4 to generate and locally compile a TypeScript file for small interactive apps.

Related topics