Vision Now Available in Llama.cpp

What’s New in llama.cpp Vision Support

  • Vision support has been reintroduced and generalized:
    • Unified under a new llama-mtmd-cli tool instead of per-model CLIs.
    • Integrated into llama-server, so the OpenAI-compatible HTTP API and web UI can now handle images.
    • Image-to-embedding preprocessing is moved into a separate library, similar in spirit to separating tokenizers for text models.

Model & Runtime Support

  • Supports a wide range of multimodal models, including Gemma 3 (4B–27B), Pixtral/Mistral Small, and SmolVLM/SmolVLM2 (including video variants).
  • Compared to Ollama:
    • Tighter integration with the ggml stack allows more aggressive optimizations (2D-RoPE tricks, upcoming flash attention) and generally more models.
    • Ollama has some features llama.cpp lacks (e.g., Gemma 3 iSWA / interleaved sliding window attention), and now uses its own Go-based runner for new models.
  • Vision had existed before (e.g., Llava-style models) but was deprecated; this is a cleaner, generalized reintroduction.

Performance, Installation, and Tooling

  • Users report good speeds on Apple Silicon (M1/M2), older PCs, and Vulkan GPUs; 4B vision models can describe images in ~15 seconds on an M1.
  • GPU offload is tuned via -ngl; Metal now auto-maxes this by default, CUDA still requires explicit values.
  • Installation paths discussed:
    • Build from source (cmake) or use Homebrew (--HEAD or upgrade once formula updates).
    • Precompiled multi-platform binaries exist; macOS users may need to clear quarantine attributes.

Use Cases and Experiments

  • Photo management: auto-generating keywords, descriptions, basic OCR, and location/context inference for large image sets; results stored in SQLite for search and summarization.
  • SmolVLM series suggested for real-time, low-resource tasks like home video surveillance.
  • Ideas floated for UI development tooling and automated screenshot-to-feedback workflows.

Limitations, Bugs, and Quality Issues

  • Some users initially got clearly wrong but highly specific “stock” descriptions, traced to images not actually loading.
  • Quality of tiny models (sub-2.2B) is questioned; 4B works “good enough” for tagging but misses finer details versus larger multimodal models.
  • No image generation support; llama.cpp focuses on transformer LLMs, not diffusion models.
  • Multimodal benchmarking for open-source models is seen as underdeveloped, and open models are viewed as lagging behind closed-source offerings.

Broader AI Reflections

  • Some commenters are excited about edge inference and rapid app development; others are skeptical about claims of near-term macroeconomic impact.
  • Debate over whether current LLMs are just “stochastic parrots” versus being capable of emergent reasoning when placed in feedback loops.