2025-05-10

Vision Now Available in Llama.cpp

What’s New in llama.cpp Vision Support

Vision support has been reintroduced and generalized:
- Unified under a new llama-mtmd-cli tool instead of per-model CLIs.
- Integrated into llama-server, so the OpenAI-compatible HTTP API and web UI can now handle images.
- Image-to-embedding preprocessing is moved into a separate library, similar in spirit to separating tokenizers for text models.

Model & Runtime Support

Supports a wide range of multimodal models, including Gemma 3 (4B–27B), Pixtral/Mistral Small, and SmolVLM/SmolVLM2 (including video variants).
Compared to Ollama:
- Tighter integration with the ggml stack allows more aggressive optimizations (2D-RoPE tricks, upcoming flash attention) and generally more models.
- Ollama has some features llama.cpp lacks (e.g., Gemma 3 iSWA / interleaved sliding window attention), and now uses its own Go-based runner for new models.
Vision had existed before (e.g., Llava-style models) but was deprecated; this is a cleaner, generalized reintroduction.

Performance, Installation, and Tooling

Users report good speeds on Apple Silicon (M1/M2), older PCs, and Vulkan GPUs; 4B vision models can describe images in ~15 seconds on an M1.
GPU offload is tuned via -ngl; Metal now auto-maxes this by default, CUDA still requires explicit values.
Installation paths discussed:
- Build from source (cmake) or use Homebrew (--HEAD or upgrade once formula updates).
- Precompiled multi-platform binaries exist; macOS users may need to clear quarantine attributes.

Use Cases and Experiments

Photo management: auto-generating keywords, descriptions, basic OCR, and location/context inference for large image sets; results stored in SQLite for search and summarization.
SmolVLM series suggested for real-time, low-resource tasks like home video surveillance.
Ideas floated for UI development tooling and automated screenshot-to-feedback workflows.

Limitations, Bugs, and Quality Issues

Some users initially got clearly wrong but highly specific “stock” descriptions, traced to images not actually loading.
Quality of tiny models (sub-2.2B) is questioned; 4B works “good enough” for tagging but misses finer details versus larger multimodal models.
No image generation support; llama.cpp focuses on transformer LLMs, not diffusion models.
Multimodal benchmarking for open-source models is seen as underdeveloped, and open models are viewed as lagging behind closed-source offerings.

Broader AI Reflections

Some commenters are excited about edge inference and rapid app development; others are skeptical about claims of near-term macroeconomic impact.
Debate over whether current LLMs are just “stochastic parrots” versus being capable of emergent reasoning when placed in feedback loops.

Related topics