Llama.vim – Local LLM-assisted text completion
Perceived usefulness of LLM code assistants
- Experiences vary widely. Some find local models produce plausible but wrong “garbage,” especially on complex or niche tasks.
- Others report strong gains for:
- Boilerplate and repetitive code.
- One-off scripts and small utilities.
- Unit test generation (needing review/fixes but cutting authoring time drastically).
- Design brainstorming and alternative approaches, even when code isn’t copy-paste ready.
- Several note that assistants are more like “power tools” than a replacement developer: useful if you already know what you’re doing and can validate output.
Hosted vs local models
- Hosted, state-of-the-art models are generally considered higher quality.
- Local models that are economical to self-host (small, heavily quantized) often underperform; some only find larger local models (e.g., ~70B) worthwhile.
- Latency and quality tradeoffs lead some to prefer traditional LSP completion over LLMs for day-to-day coding.
Domain, language, and documentation issues
- Quality is highly uneven across domains and languages; web and popular languages fare better than specialized areas (compilers, hardware SDKs, niche languages).
- Models often use outdated APIs (e.g., older game engine versions) even when instructed otherwise.
- Retrieval-Augmented Generation (RAG) is highlighted as a way to ground answers in up-to-date docs, but is not yet seamless in many tools.
Editor integrations and workflows
- Vim/neovim, Emacs, and VS Code all have multiple LLM integrations; some use Vim specifically when they don’t want AI help.
- Some prefer chat-style tools; others rely solely on inline completion (especially fill-in-the-middle / FIM).
- LSP-based completion plus snippets are, for some, “good enough,” with lower latency than LLMs.
Technical design of llama.vim / llama.cpp server
- Plugin uses a “ring context” and KV cache shifting to:
- Reuse previously computed context across requests.
- Maintain a large effective context (thousands of tokens) without recomputing everything.
- Context is split into:
- Local context around the cursor.
- Global context stored in a ring buffer, reused via cache shifting.
- Context size and batch size are tunable to trade off speed vs quality.
- Completion stopping criteria include time limit, token count, indentation heuristics, and low token probability; the last may currently truncate larger completions.
Hardware and performance considerations
- VRAM is the main bottleneck. Very low VRAM (e.g., 2 GB) is widely seen as insufficient for attractive models.
- Users report:
- 7B models running acceptably on CPUs with enough RAM (e.g., 32–64 GB), though slower.
- Better experience with consumer GPUs in the ~12–24 GB VRAM range.
- Apple M-series machines perform surprisingly well due to unified memory.
- Upgrading system RAM can enable large models on CPU at low token rates; GPUs remain preferred for interactive use.
Comparisons and alternatives
- Some switch from Copilot-like hosted tools to llama.vim-based or other local solutions due to speed, privacy, and cost.
- Others disable AI tools entirely after finding modern LSPs sufficient.
- Tabby and other local copilot-like servers are discussed; differences include how they gather context (editor-following vs RAG-based).