2025-01-23

Llama.vim – Local LLM-assisted text completion

Perceived usefulness of LLM code assistants

Experiences vary widely. Some find local models produce plausible but wrong “garbage,” especially on complex or niche tasks.
Others report strong gains for:
- Boilerplate and repetitive code.
- One-off scripts and small utilities.
- Unit test generation (needing review/fixes but cutting authoring time drastically).
- Design brainstorming and alternative approaches, even when code isn’t copy-paste ready.
Several note that assistants are more like “power tools” than a replacement developer: useful if you already know what you’re doing and can validate output.

Hosted vs local models

Hosted, state-of-the-art models are generally considered higher quality.
Local models that are economical to self-host (small, heavily quantized) often underperform; some only find larger local models (e.g., ~70B) worthwhile.
Latency and quality tradeoffs lead some to prefer traditional LSP completion over LLMs for day-to-day coding.

Domain, language, and documentation issues

Quality is highly uneven across domains and languages; web and popular languages fare better than specialized areas (compilers, hardware SDKs, niche languages).
Models often use outdated APIs (e.g., older game engine versions) even when instructed otherwise.
Retrieval-Augmented Generation (RAG) is highlighted as a way to ground answers in up-to-date docs, but is not yet seamless in many tools.

Editor integrations and workflows

Vim/neovim, Emacs, and VS Code all have multiple LLM integrations; some use Vim specifically when they don’t want AI help.
Some prefer chat-style tools; others rely solely on inline completion (especially fill-in-the-middle / FIM).
LSP-based completion plus snippets are, for some, “good enough,” with lower latency than LLMs.

Technical design of llama.vim / llama.cpp server

Plugin uses a “ring context” and KV cache shifting to:
- Reuse previously computed context across requests.
- Maintain a large effective context (thousands of tokens) without recomputing everything.
Context is split into:
- Local context around the cursor.
- Global context stored in a ring buffer, reused via cache shifting.
Context size and batch size are tunable to trade off speed vs quality.
Completion stopping criteria include time limit, token count, indentation heuristics, and low token probability; the last may currently truncate larger completions.

Hardware and performance considerations

VRAM is the main bottleneck. Very low VRAM (e.g., 2 GB) is widely seen as insufficient for attractive models.
Users report:
- 7B models running acceptably on CPUs with enough RAM (e.g., 32–64 GB), though slower.
- Better experience with consumer GPUs in the ~12–24 GB VRAM range.
- Apple M-series machines perform surprisingly well due to unified memory.
Upgrading system RAM can enable large models on CPU at low token rates; GPUs remain preferred for interactive use.

Comparisons and alternatives

Some switch from Copilot-like hosted tools to llama.vim-based or other local solutions due to speed, privacy, and cost.
Others disable AI tools entirely after finding modern LSPs sufficient.
Tabby and other local copilot-like servers are discussed; differences include how they gather context (editor-following vs RAG-based).

Related topics