2025-05-03

Run LLMs on Apple Neural Engine (ANE)

Role and value of the Apple Neural Engine (ANE)

Mixed views on whether ANE is “wasted silicon” vs. a smart low‑power accelerator.
Some argue Apple should have just added tensor cores to the GPU (GPU+tensor > GPU+separate NPU for perf/area).
Counterpoint: ANE gives much better performance per watt and is tuned for mobile/“background” ML tasks (OCR, Photos features, speech, on‑device “Apple Intelligence”), leaving the GPU free for graphics.

Hardware characteristics & bottlenecks

ANE excels at FP16 / INT8 matrix multiply–accumulate (systolic arrays), but:
- Limited to static shapes; variable-length attention and KV cache require workarounds (chunking, sliding fixed-size caches).
- ANE supports FP16 + integers only; GPU got bfloat16 from M2, but not ANE.
- Main bottleneck is memory bandwidth: historically ~64 GB/s on early chips, improved on M3/M4 but still lower than GPU bandwidth.
For LLMs, especially quantized ones, memory bandwidth dominates; GPU is usually faster for 3–8B models and up.

Performance, power, and Anemll results

Benchmarks in the thread (M4 Max, ~8B model):
- ANE/Anemll: ~9 tok/s, ~0.5 GB RAM.
- MLX (GPU, 8‑bit): ~50 tok/s, ~8.5 GB RAM.
- llama.cpp (GPU, 8‑bit): ~41 tok/s.
Another user: ANE gives about half the throughput of GPU but roughly 10x lower power (≈2 W vs. ≈20 W), e.g. 47–62 tok/s on 1B models at a few watts.
ANE can enable much lower memory footprints but may stream layers or use chunking to fit, with unclear overhead.

Tooling, APIs, and closed ecosystem issues

Core ML / coremltools are the only practical way to reach ANE; low‑level access is effectively closed.
Constraints: static I/O shapes, limited dtypes, and fragile model conversion (Core ML and ONNX routes reported as brittle, under‑maintained, and hard to debug).
Even Apple’s own MLX framework doesn’t support ANE due to its closed API; hobby projects (tinygrad, bare‑metal ANE) are outdated or blocked.

Use cases & limits

ANE is widely used for “light” inference (vision, OCR, speech, small transformers) where low power and low thermal impact matter.
Training on ANE is generally seen as impractical; Apple’s own TensorFlow‑Metal uses GPU only.
Context length is currently limited in some ANE LLM deployments (often 512–2k tokens), with workarounds but no seamless large‑context support.

Broader NPU comparison & skepticism

Similar complaints are raised about Qualcomm, Intel, and AMD NPUs: good for small, low‑power models but not competitive with GPUs for larger LLMs.
Some see all current NPUs (including ANE) as tightly constrained, use‑case‑specific hardware whose software stacks lag ML research and are not yet “serious” general ML platforms.

Related topics