2024-09-21

Forget ChatGPT: why researchers now run small AIs on their laptops

Why run small / local models?

Privacy and control: Avoid sending sensitive or proprietary data to remote services; some users work on air‑gapped or highly regulated systems.
Stability and reproducibility: Hosted models change silently; local models are version‑pinned and debuggable.
Customization: Easier to fine‑tune, remove safety filters, or build uncensored/“abliterated” variants for domains that hosted models refuse.
Offline use: Useful on flights, in poor connectivity, or as a “local Google” for coding and systems work.
Cost & lock‑in: One‑time hardware spend vs ongoing API bills and vendor lock‑in, especially for fine‑tuned models.

Hardware, performance, and scaling

Upcoming laptop chips (e.g., AMD Strix Halo, Apple M‑series) offer large unified memory and NPUs but limited bandwidth vs high‑end GPUs.
Bandwidth and VRAM both matter; many note that 70B+ models on laptop‑class hardware are slow (a few tokens/s, long time‑to‑first‑token).
Tricks: quantization, MoE, multi‑Mac clusters, offloading layers to a discrete GPU, mmap’ing weights from disk.
High‑end local setups (multi‑4090s, H100s, big Xeon RAM boxes) can run 128B–405B models, but cost and power are substantial.

Model quality: small vs frontier

Consensus: 8–14B local models (Llama 3.1 8B, Qwen, Mistral‑class) are now “good enough” for many tasks (summarization, basic coding, note cleanup).
Several commenters still find them clearly worse than GPT‑4/Claude, especially for complex reasoning, robust codegen, and general knowledge.
Some argue big labs keep scaling because small models cannot truly compete; others think efficient small models plus systems work may erode that lead.

Workflows & tools

Popular stacks: Ollama, llama.cpp / llamafiles, LM Studio, OpenWebUI, Jan, Twinny, GPT4All, various IDE integrations (Continue, gen.nvim, local Copilot‑style autocomplete).
Common use cases: personal knowledge bases with embeddings + RAG, Obsidian integration, email spam filtering, local Perplexity‑style web search, code assistance, multimodal OCR/screenshot QA, and voice‑note → Whisper → LLM → structured notes.

Licensing, data, and “openness”

Distinction between open‑weights and open‑source licenses; Llama 3.1 uses a community license with MAU caps and restrictions on training other models.
Debate over licenses that forbid using outputs to train competitors, versus the field’s reliance on scraped and synthetic data.
Synthetic data / distillation (e.g., Phi‑style “textbook” training) seen as promising but with questions about real‑world robustness.

Related topics