Run LLMs on Apple Neural Engine (ANE)

Role and value of the Apple Neural Engine (ANE)

  • Mixed views on whether ANE is “wasted silicon” vs. a smart low‑power accelerator.
  • Some argue Apple should have just added tensor cores to the GPU (GPU+tensor > GPU+separate NPU for perf/area).
  • Counterpoint: ANE gives much better performance per watt and is tuned for mobile/“background” ML tasks (OCR, Photos features, speech, on‑device “Apple Intelligence”), leaving the GPU free for graphics.

Hardware characteristics & bottlenecks

  • ANE excels at FP16 / INT8 matrix multiply–accumulate (systolic arrays), but:
    • Limited to static shapes; variable-length attention and KV cache require workarounds (chunking, sliding fixed-size caches).
    • ANE supports FP16 + integers only; GPU got bfloat16 from M2, but not ANE.
    • Main bottleneck is memory bandwidth: historically ~64 GB/s on early chips, improved on M3/M4 but still lower than GPU bandwidth.
  • For LLMs, especially quantized ones, memory bandwidth dominates; GPU is usually faster for 3–8B models and up.

Performance, power, and Anemll results

  • Benchmarks in the thread (M4 Max, ~8B model):
    • ANE/Anemll: ~9 tok/s, ~0.5 GB RAM.
    • MLX (GPU, 8‑bit): ~50 tok/s, ~8.5 GB RAM.
    • llama.cpp (GPU, 8‑bit): ~41 tok/s.
  • Another user: ANE gives about half the throughput of GPU but roughly 10x lower power (≈2 W vs. ≈20 W), e.g. 47–62 tok/s on 1B models at a few watts.
  • ANE can enable much lower memory footprints but may stream layers or use chunking to fit, with unclear overhead.

Tooling, APIs, and closed ecosystem issues

  • Core ML / coremltools are the only practical way to reach ANE; low‑level access is effectively closed.
  • Constraints: static I/O shapes, limited dtypes, and fragile model conversion (Core ML and ONNX routes reported as brittle, under‑maintained, and hard to debug).
  • Even Apple’s own MLX framework doesn’t support ANE due to its closed API; hobby projects (tinygrad, bare‑metal ANE) are outdated or blocked.

Use cases & limits

  • ANE is widely used for “light” inference (vision, OCR, speech, small transformers) where low power and low thermal impact matter.
  • Training on ANE is generally seen as impractical; Apple’s own TensorFlow‑Metal uses GPU only.
  • Context length is currently limited in some ANE LLM deployments (often 512–2k tokens), with workarounds but no seamless large‑context support.

Broader NPU comparison & skepticism

  • Similar complaints are raised about Qualcomm, Intel, and AMD NPUs: good for small, low‑power models but not competitive with GPUs for larger LLMs.
  • Some see all current NPUs (including ANE) as tightly constrained, use‑case‑specific hardware whose software stacks lag ML research and are not yet “serious” general ML platforms.