Forget ChatGPT: why researchers now run small AIs on their laptops
Why run small / local models?
- Privacy and control: Avoid sending sensitive or proprietary data to remote services; some users work on air‑gapped or highly regulated systems.
- Stability and reproducibility: Hosted models change silently; local models are version‑pinned and debuggable.
- Customization: Easier to fine‑tune, remove safety filters, or build uncensored/“abliterated” variants for domains that hosted models refuse.
- Offline use: Useful on flights, in poor connectivity, or as a “local Google” for coding and systems work.
- Cost & lock‑in: One‑time hardware spend vs ongoing API bills and vendor lock‑in, especially for fine‑tuned models.
Hardware, performance, and scaling
- Upcoming laptop chips (e.g., AMD Strix Halo, Apple M‑series) offer large unified memory and NPUs but limited bandwidth vs high‑end GPUs.
- Bandwidth and VRAM both matter; many note that 70B+ models on laptop‑class hardware are slow (a few tokens/s, long time‑to‑first‑token).
- Tricks: quantization, MoE, multi‑Mac clusters, offloading layers to a discrete GPU, mmap’ing weights from disk.
- High‑end local setups (multi‑4090s, H100s, big Xeon RAM boxes) can run 128B–405B models, but cost and power are substantial.
Model quality: small vs frontier
- Consensus: 8–14B local models (Llama 3.1 8B, Qwen, Mistral‑class) are now “good enough” for many tasks (summarization, basic coding, note cleanup).
- Several commenters still find them clearly worse than GPT‑4/Claude, especially for complex reasoning, robust codegen, and general knowledge.
- Some argue big labs keep scaling because small models cannot truly compete; others think efficient small models plus systems work may erode that lead.
Workflows & tools
- Popular stacks: Ollama, llama.cpp / llamafiles, LM Studio, OpenWebUI, Jan, Twinny, GPT4All, various IDE integrations (Continue, gen.nvim, local Copilot‑style autocomplete).
- Common use cases: personal knowledge bases with embeddings + RAG, Obsidian integration, email spam filtering, local Perplexity‑style web search, code assistance, multimodal OCR/screenshot QA, and voice‑note → Whisper → LLM → structured notes.
Licensing, data, and “openness”
- Distinction between open‑weights and open‑source licenses; Llama 3.1 uses a community license with MAU caps and restrictions on training other models.
- Debate over licenses that forbid using outputs to train competitors, versus the field’s reliance on scraped and synthetic data.
- Synthetic data / distillation (e.g., Phi‑style “textbook” training) seen as promising but with questions about real‑world robustness.