Run LLMs on Apple Neural Engine (ANE)
Role and value of the Apple Neural Engine (ANE)
- Mixed views on whether ANE is “wasted silicon” vs. a smart low‑power accelerator.
- Some argue Apple should have just added tensor cores to the GPU (GPU+tensor > GPU+separate NPU for perf/area).
- Counterpoint: ANE gives much better performance per watt and is tuned for mobile/“background” ML tasks (OCR, Photos features, speech, on‑device “Apple Intelligence”), leaving the GPU free for graphics.
Hardware characteristics & bottlenecks
- ANE excels at FP16 / INT8 matrix multiply–accumulate (systolic arrays), but:
- Limited to static shapes; variable-length attention and KV cache require workarounds (chunking, sliding fixed-size caches).
- ANE supports FP16 + integers only; GPU got bfloat16 from M2, but not ANE.
- Main bottleneck is memory bandwidth: historically ~64 GB/s on early chips, improved on M3/M4 but still lower than GPU bandwidth.
- For LLMs, especially quantized ones, memory bandwidth dominates; GPU is usually faster for 3–8B models and up.
Performance, power, and Anemll results
- Benchmarks in the thread (M4 Max, ~8B model):
- ANE/Anemll: ~9 tok/s, ~0.5 GB RAM.
- MLX (GPU, 8‑bit): ~50 tok/s, ~8.5 GB RAM.
- llama.cpp (GPU, 8‑bit): ~41 tok/s.
- Another user: ANE gives about half the throughput of GPU but roughly 10x lower power (≈2 W vs. ≈20 W), e.g. 47–62 tok/s on 1B models at a few watts.
- ANE can enable much lower memory footprints but may stream layers or use chunking to fit, with unclear overhead.
Tooling, APIs, and closed ecosystem issues
- Core ML / coremltools are the only practical way to reach ANE; low‑level access is effectively closed.
- Constraints: static I/O shapes, limited dtypes, and fragile model conversion (Core ML and ONNX routes reported as brittle, under‑maintained, and hard to debug).
- Even Apple’s own MLX framework doesn’t support ANE due to its closed API; hobby projects (tinygrad, bare‑metal ANE) are outdated or blocked.
Use cases & limits
- ANE is widely used for “light” inference (vision, OCR, speech, small transformers) where low power and low thermal impact matter.
- Training on ANE is generally seen as impractical; Apple’s own TensorFlow‑Metal uses GPU only.
- Context length is currently limited in some ANE LLM deployments (often 512–2k tokens), with workarounds but no seamless large‑context support.
Broader NPU comparison & skepticism
- Similar complaints are raised about Qualcomm, Intel, and AMD NPUs: good for small, low‑power models but not competitive with GPUs for larger LLMs.
- Some see all current NPUs (including ANE) as tightly constrained, use‑case‑specific hardware whose software stacks lag ML research and are not yet “serious” general ML platforms.