Lemonade by AMD: a fast and open source local LLM server using GPU and NPU

What Lemonade Is Trying to Be

  • Positioned as a unified local AI server and management layer focused on AMD hardware.
  • Bundles multiple runtimes/backends: llama.cpp for text/vision, diffusion for images, Whisper-style STT, TTS, and NPU runtimes (FastFlowLM).
  • Exposes OpenAI-, Ollama-, and Anthropic-compatible endpoints so existing tools and UIs can talk to it.
  • Includes its own web UI for model management, configuration, and interaction.

Comparison to Ollama, LM Studio, vLLM

  • Multiple commenters see it as “between Ollama and LM Studio”: more orchestration and multi‑modal support than simple model serving.
  • Under the hood, both Lemonade and Ollama rely on llama.cpp; Lemonade adds AMD-tuned builds and multi-backend routing.
  • A small benchmark on an M1 Max showed Lemonade modestly faster than Ollama for one Qwen3.5 9B prompt, but this is anecdotal.
  • Some prefer using Lemonade’s ROCm‑optimized llama.cpp builds directly instead of the full server.

Performance, ROCm vs Vulkan

  • Reports that Vulkan can outperform ROCm on some AMD GPUs, especially integrated/APUs; others see ROCm faster on high-end cards like 7900 XTX.
  • A linked ROCm issue notes current regressions; expectation is ROCm should be faster if fixed.
  • Users report strong performance on Strix Halo and various Radeon cards, especially with Vulkan and newer kernels.

NPU Role and Limitations

  • NPU support uses FastFlowLM; its NPU kernels are proprietary (free for non‑commercial use, commercial license otherwise).
  • Consensus: NPUs are best for small, always‑on, low‑power models (e.g., STT/TTS, small LLMs, prefill offload), not large chatbot workloads.
  • On Strix Halo, NPU performance is described as underwhelming compared to the GPU/APU but effectively “free” power-wise.

Packaging and Platform Support

  • Provides deb/rpm, Ubuntu PPA, Snap, macOS beta, and container options (though some think Docker instructions should be more prominent).
  • macOS uses Metal now; MLX support is on the roadmap.

Enthusiasm vs Skepticism

  • Enthusiastic AMD users describe Lemonade as the easiest turnkey way to run local AI on AMD (especially Strix Halo).
  • Others criticize ROCm as unstable, complain about crashes when exceeding VRAM, or dismiss Lemonade as unnecessary “slop” over plain llama.cpp with Vulkan.
  • Some worry about vendor-specific stacks and proprietary NPU pieces limiting openness and portability.