2026-04-02

Lemonade by AMD: a fast and open source local LLM server using GPU and NPU

What Lemonade Is Trying to Be

Positioned as a unified local AI server and management layer focused on AMD hardware.
Bundles multiple runtimes/backends: llama.cpp for text/vision, diffusion for images, Whisper-style STT, TTS, and NPU runtimes (FastFlowLM).
Exposes OpenAI-, Ollama-, and Anthropic-compatible endpoints so existing tools and UIs can talk to it.
Includes its own web UI for model management, configuration, and interaction.

Comparison to Ollama, LM Studio, vLLM

Multiple commenters see it as “between Ollama and LM Studio”: more orchestration and multi‑modal support than simple model serving.
Under the hood, both Lemonade and Ollama rely on llama.cpp; Lemonade adds AMD-tuned builds and multi-backend routing.
A small benchmark on an M1 Max showed Lemonade modestly faster than Ollama for one Qwen3.5 9B prompt, but this is anecdotal.
Some prefer using Lemonade’s ROCm‑optimized llama.cpp builds directly instead of the full server.

Performance, ROCm vs Vulkan

Reports that Vulkan can outperform ROCm on some AMD GPUs, especially integrated/APUs; others see ROCm faster on high-end cards like 7900 XTX.
A linked ROCm issue notes current regressions; expectation is ROCm should be faster if fixed.
Users report strong performance on Strix Halo and various Radeon cards, especially with Vulkan and newer kernels.

NPU Role and Limitations

NPU support uses FastFlowLM; its NPU kernels are proprietary (free for non‑commercial use, commercial license otherwise).
Consensus: NPUs are best for small, always‑on, low‑power models (e.g., STT/TTS, small LLMs, prefill offload), not large chatbot workloads.
On Strix Halo, NPU performance is described as underwhelming compared to the GPU/APU but effectively “free” power-wise.

Packaging and Platform Support

Provides deb/rpm, Ubuntu PPA, Snap, macOS beta, and container options (though some think Docker instructions should be more prominent).
macOS uses Metal now; MLX support is on the roadmap.

Enthusiasm vs Skepticism

Enthusiastic AMD users describe Lemonade as the easiest turnkey way to run local AI on AMD (especially Strix Halo).
Others criticize ROCm as unstable, complain about crashes when exceeding VRAM, or dismiss Lemonade as unnecessary “slop” over plain llama.cpp with Vulkan.
Some worry about vendor-specific stacks and proprietary NPU pieces limiting openness and portability.

Related topics