Llama.vim – Local LLM-assisted text completion

Perceived usefulness of LLM code assistants

  • Experiences vary widely. Some find local models produce plausible but wrong “garbage,” especially on complex or niche tasks.
  • Others report strong gains for:
    • Boilerplate and repetitive code.
    • One-off scripts and small utilities.
    • Unit test generation (needing review/fixes but cutting authoring time drastically).
    • Design brainstorming and alternative approaches, even when code isn’t copy-paste ready.
  • Several note that assistants are more like “power tools” than a replacement developer: useful if you already know what you’re doing and can validate output.

Hosted vs local models

  • Hosted, state-of-the-art models are generally considered higher quality.
  • Local models that are economical to self-host (small, heavily quantized) often underperform; some only find larger local models (e.g., ~70B) worthwhile.
  • Latency and quality tradeoffs lead some to prefer traditional LSP completion over LLMs for day-to-day coding.

Domain, language, and documentation issues

  • Quality is highly uneven across domains and languages; web and popular languages fare better than specialized areas (compilers, hardware SDKs, niche languages).
  • Models often use outdated APIs (e.g., older game engine versions) even when instructed otherwise.
  • Retrieval-Augmented Generation (RAG) is highlighted as a way to ground answers in up-to-date docs, but is not yet seamless in many tools.

Editor integrations and workflows

  • Vim/neovim, Emacs, and VS Code all have multiple LLM integrations; some use Vim specifically when they don’t want AI help.
  • Some prefer chat-style tools; others rely solely on inline completion (especially fill-in-the-middle / FIM).
  • LSP-based completion plus snippets are, for some, “good enough,” with lower latency than LLMs.

Technical design of llama.vim / llama.cpp server

  • Plugin uses a “ring context” and KV cache shifting to:
    • Reuse previously computed context across requests.
    • Maintain a large effective context (thousands of tokens) without recomputing everything.
  • Context is split into:
    • Local context around the cursor.
    • Global context stored in a ring buffer, reused via cache shifting.
  • Context size and batch size are tunable to trade off speed vs quality.
  • Completion stopping criteria include time limit, token count, indentation heuristics, and low token probability; the last may currently truncate larger completions.

Hardware and performance considerations

  • VRAM is the main bottleneck. Very low VRAM (e.g., 2 GB) is widely seen as insufficient for attractive models.
  • Users report:
    • 7B models running acceptably on CPUs with enough RAM (e.g., 32–64 GB), though slower.
    • Better experience with consumer GPUs in the ~12–24 GB VRAM range.
    • Apple M-series machines perform surprisingly well due to unified memory.
  • Upgrading system RAM can enable large models on CPU at low token rates; GPUs remain preferred for interactive use.

Comparisons and alternatives

  • Some switch from Copilot-like hosted tools to llama.vim-based or other local solutions due to speed, privacy, and cost.
  • Others disable AI tools entirely after finding modern LSPs sufficient.
  • Tabby and other local copilot-like servers are discussed; differences include how they gather context (editor-following vs RAG-based).