Lemonade by AMD: a fast and open source local LLM server using GPU and NPU
What Lemonade Is Trying to Be
- Positioned as a unified local AI server and management layer focused on AMD hardware.
- Bundles multiple runtimes/backends:
llama.cppfor text/vision, diffusion for images, Whisper-style STT, TTS, and NPU runtimes (FastFlowLM). - Exposes OpenAI-, Ollama-, and Anthropic-compatible endpoints so existing tools and UIs can talk to it.
- Includes its own web UI for model management, configuration, and interaction.
Comparison to Ollama, LM Studio, vLLM
- Multiple commenters see it as “between Ollama and LM Studio”: more orchestration and multi‑modal support than simple model serving.
- Under the hood, both Lemonade and Ollama rely on
llama.cpp; Lemonade adds AMD-tuned builds and multi-backend routing. - A small benchmark on an M1 Max showed Lemonade modestly faster than Ollama for one Qwen3.5 9B prompt, but this is anecdotal.
- Some prefer using Lemonade’s ROCm‑optimized
llama.cppbuilds directly instead of the full server.
Performance, ROCm vs Vulkan
- Reports that Vulkan can outperform ROCm on some AMD GPUs, especially integrated/APUs; others see ROCm faster on high-end cards like 7900 XTX.
- A linked ROCm issue notes current regressions; expectation is ROCm should be faster if fixed.
- Users report strong performance on Strix Halo and various Radeon cards, especially with Vulkan and newer kernels.
NPU Role and Limitations
- NPU support uses FastFlowLM; its NPU kernels are proprietary (free for non‑commercial use, commercial license otherwise).
- Consensus: NPUs are best for small, always‑on, low‑power models (e.g., STT/TTS, small LLMs, prefill offload), not large chatbot workloads.
- On Strix Halo, NPU performance is described as underwhelming compared to the GPU/APU but effectively “free” power-wise.
Packaging and Platform Support
- Provides deb/rpm, Ubuntu PPA, Snap, macOS beta, and container options (though some think Docker instructions should be more prominent).
- macOS uses Metal now; MLX support is on the roadmap.
Enthusiasm vs Skepticism
- Enthusiastic AMD users describe Lemonade as the easiest turnkey way to run local AI on AMD (especially Strix Halo).
- Others criticize ROCm as unstable, complain about crashes when exceeding VRAM, or dismiss Lemonade as unnecessary “slop” over plain
llama.cppwith Vulkan. - Some worry about vendor-specific stacks and proprietary NPU pieces limiting openness and portability.