How I run LLMs locally

Perceived value of the article

  • Some readers find it a helpful, concise link hub; others call it a semi-organized link dump lacking depth and missing key resources.
  • Several note the “missing piece” is a direct comparison of local models vs top hosted models beyond privacy.

Local vs cloud LLMs: performance and quality

  • Multiple reports that 7–8B models run very fast on consumer GPUs (e.g., 3060/3090/4090), often matching or beating hosted open-model APIs.
  • Others see 20+ second latencies and conclude local isn’t worth it; replies suggest misconfiguration (CPU inference, model not fitting VRAM, no streaming, model unloads).
  • Several argue smaller local models are fine for casual/creative use, but ~30B+ is needed for more reliable coding/logic; even 70B locals still trail Claude/GPT for “professional” work.
  • A minority claim local models are “toys” and non-competitive; others counter with benchmarks where open models rival or beat some closed ones in specific domains.

Hardware choices and economics

  • Strong consensus: for PCs, Nvidia with as much VRAM as possible; used 3090s (24 GB) or dual 3090s (48 GB) are popular for 70B models.
  • 12 GB cards (e.g., 3060) are acceptable for 7–8B models but limiting for larger models.
  • Some advocate Macs with large unified memory (64–128 GB) as cost-effective, especially for inference-only workloads; others argue multi-GPU PCs offer better raw performance and scalability.
  • Datacenter GPUs (A100, L40S) are seen as overkill and economically dubious for home; renting or using APIs is usually recommended instead.

Software stacks and frontends

  • Popular stacks: text-generation-webui / “oobabooga,” Ollama, llamafile, Open WebUI, LM Studio, Jan, Msty, AnythingLLM, Lobe Chat.
  • Oobabooga is praised as the most “pro” UI with extensive tuning options, but setup can be painful.
  • Open WebUI is feature rich but heavy; some seek minimalist frontends.
  • Several mention speculative decoding and quantization (Q4–Q8) as key for speed and fitting models in VRAM.

Use cases and when local makes sense

  • Reported uses: coding assistance, personal knowledge work, RAG over private docs, blog content, images, and experimentation.
  • Many say APIs (GPT/Claude/etc.) are cheaper and simpler for serious, latency-sensitive, or customer-facing work.
  • Local is favored when privacy, ownership, learning, or hobbyist tinkering are primary goals.

Privacy, trust, and data concerns

  • Strong divide: some trust cloud EULAs and see local-only privacy worries as “FUD”; others deeply distrust big tech promises and prefer full local control.
  • A recurring theme: small local models are “good enough” for many private tasks and avoid profiling and future data breaches.

Copyright, training data, and “unknown creators”

  • Acknowledgment of uncredited creators behind training data sparks debate.
  • Some see such credit lines as empty “land acknowledgements”; others see them as necessary reminders.
  • Long thread on whether training should be copyright-limited:
    • One side argues copying for training should trigger compensation, fearing exploitation of the commons.
    • Another side notes humans already “train” on copyrighted works; over-restricting training could centralize power in a few IP-rich incumbents.
  • Concerns that incentives to contribute to public knowledge platforms may erode if value is captured primarily by AI vendors.

Scaling to “business-class” clusters

  • People ask for guides to multi-A100 / 70B+ setups; replies say:
    • True datacenter configurations are complex, power-hungry, and expensive, with few public “optimal” guides.
    • For most non-enterprise use, multi-3090 rigs or cloud rentals are more practical.