Vision Now Available in Llama.cpp
What’s New in llama.cpp Vision Support
- Vision support has been reintroduced and generalized:
- Unified under a new
llama-mtmd-clitool instead of per-model CLIs. - Integrated into
llama-server, so the OpenAI-compatible HTTP API and web UI can now handle images. - Image-to-embedding preprocessing is moved into a separate library, similar in spirit to separating tokenizers for text models.
- Unified under a new
Model & Runtime Support
- Supports a wide range of multimodal models, including Gemma 3 (4B–27B), Pixtral/Mistral Small, and SmolVLM/SmolVLM2 (including video variants).
- Compared to Ollama:
- Tighter integration with the ggml stack allows more aggressive optimizations (2D-RoPE tricks, upcoming flash attention) and generally more models.
- Ollama has some features llama.cpp lacks (e.g., Gemma 3 iSWA / interleaved sliding window attention), and now uses its own Go-based runner for new models.
- Vision had existed before (e.g., Llava-style models) but was deprecated; this is a cleaner, generalized reintroduction.
Performance, Installation, and Tooling
- Users report good speeds on Apple Silicon (M1/M2), older PCs, and Vulkan GPUs; 4B vision models can describe images in ~15 seconds on an M1.
- GPU offload is tuned via
-ngl; Metal now auto-maxes this by default, CUDA still requires explicit values. - Installation paths discussed:
- Build from source (cmake) or use Homebrew (
--HEADor upgrade once formula updates). - Precompiled multi-platform binaries exist; macOS users may need to clear quarantine attributes.
- Build from source (cmake) or use Homebrew (
Use Cases and Experiments
- Photo management: auto-generating keywords, descriptions, basic OCR, and location/context inference for large image sets; results stored in SQLite for search and summarization.
- SmolVLM series suggested for real-time, low-resource tasks like home video surveillance.
- Ideas floated for UI development tooling and automated screenshot-to-feedback workflows.
Limitations, Bugs, and Quality Issues
- Some users initially got clearly wrong but highly specific “stock” descriptions, traced to images not actually loading.
- Quality of tiny models (sub-2.2B) is questioned; 4B works “good enough” for tagging but misses finer details versus larger multimodal models.
- No image generation support; llama.cpp focuses on transformer LLMs, not diffusion models.
- Multimodal benchmarking for open-source models is seen as underdeveloped, and open models are viewed as lagging behind closed-source offerings.
Broader AI Reflections
- Some commenters are excited about edge inference and rapid app development; others are skeptical about claims of near-term macroeconomic impact.
- Debate over whether current LLMs are just “stochastic parrots” versus being capable of emergent reasoning when placed in feedback loops.