How I run LLMs locally
Perceived value of the article
- Some readers find it a helpful, concise link hub; others call it a semi-organized link dump lacking depth and missing key resources.
- Several note the “missing piece” is a direct comparison of local models vs top hosted models beyond privacy.
Local vs cloud LLMs: performance and quality
- Multiple reports that 7–8B models run very fast on consumer GPUs (e.g., 3060/3090/4090), often matching or beating hosted open-model APIs.
- Others see 20+ second latencies and conclude local isn’t worth it; replies suggest misconfiguration (CPU inference, model not fitting VRAM, no streaming, model unloads).
- Several argue smaller local models are fine for casual/creative use, but ~30B+ is needed for more reliable coding/logic; even 70B locals still trail Claude/GPT for “professional” work.
- A minority claim local models are “toys” and non-competitive; others counter with benchmarks where open models rival or beat some closed ones in specific domains.
Hardware choices and economics
- Strong consensus: for PCs, Nvidia with as much VRAM as possible; used 3090s (24 GB) or dual 3090s (48 GB) are popular for 70B models.
- 12 GB cards (e.g., 3060) are acceptable for 7–8B models but limiting for larger models.
- Some advocate Macs with large unified memory (64–128 GB) as cost-effective, especially for inference-only workloads; others argue multi-GPU PCs offer better raw performance and scalability.
- Datacenter GPUs (A100, L40S) are seen as overkill and economically dubious for home; renting or using APIs is usually recommended instead.
Software stacks and frontends
- Popular stacks: text-generation-webui / “oobabooga,” Ollama, llamafile, Open WebUI, LM Studio, Jan, Msty, AnythingLLM, Lobe Chat.
- Oobabooga is praised as the most “pro” UI with extensive tuning options, but setup can be painful.
- Open WebUI is feature rich but heavy; some seek minimalist frontends.
- Several mention speculative decoding and quantization (Q4–Q8) as key for speed and fitting models in VRAM.
Use cases and when local makes sense
- Reported uses: coding assistance, personal knowledge work, RAG over private docs, blog content, images, and experimentation.
- Many say APIs (GPT/Claude/etc.) are cheaper and simpler for serious, latency-sensitive, or customer-facing work.
- Local is favored when privacy, ownership, learning, or hobbyist tinkering are primary goals.
Privacy, trust, and data concerns
- Strong divide: some trust cloud EULAs and see local-only privacy worries as “FUD”; others deeply distrust big tech promises and prefer full local control.
- A recurring theme: small local models are “good enough” for many private tasks and avoid profiling and future data breaches.
Copyright, training data, and “unknown creators”
- Acknowledgment of uncredited creators behind training data sparks debate.
- Some see such credit lines as empty “land acknowledgements”; others see them as necessary reminders.
- Long thread on whether training should be copyright-limited:
- One side argues copying for training should trigger compensation, fearing exploitation of the commons.
- Another side notes humans already “train” on copyrighted works; over-restricting training could centralize power in a few IP-rich incumbents.
- Concerns that incentives to contribute to public knowledge platforms may erode if value is captured primarily by AI vendors.
Scaling to “business-class” clusters
- People ask for guides to multi-A100 / 70B+ setups; replies say:
- True datacenter configurations are complex, power-hungry, and expensive, with few public “optimal” guides.
- For most non-enterprise use, multi-3090 rigs or cloud rentals are more practical.