2024-12-29

How I run LLMs locally

Perceived value of the article

Some readers find it a helpful, concise link hub; others call it a semi-organized link dump lacking depth and missing key resources.
Several note the “missing piece” is a direct comparison of local models vs top hosted models beyond privacy.

Local vs cloud LLMs: performance and quality

Multiple reports that 7–8B models run very fast on consumer GPUs (e.g., 3060/3090/4090), often matching or beating hosted open-model APIs.
Others see 20+ second latencies and conclude local isn’t worth it; replies suggest misconfiguration (CPU inference, model not fitting VRAM, no streaming, model unloads).
Several argue smaller local models are fine for casual/creative use, but ~30B+ is needed for more reliable coding/logic; even 70B locals still trail Claude/GPT for “professional” work.
A minority claim local models are “toys” and non-competitive; others counter with benchmarks where open models rival or beat some closed ones in specific domains.

Hardware choices and economics

Strong consensus: for PCs, Nvidia with as much VRAM as possible; used 3090s (24 GB) or dual 3090s (48 GB) are popular for 70B models.
12 GB cards (e.g., 3060) are acceptable for 7–8B models but limiting for larger models.
Some advocate Macs with large unified memory (64–128 GB) as cost-effective, especially for inference-only workloads; others argue multi-GPU PCs offer better raw performance and scalability.
Datacenter GPUs (A100, L40S) are seen as overkill and economically dubious for home; renting or using APIs is usually recommended instead.

Software stacks and frontends

Popular stacks: text-generation-webui / “oobabooga,” Ollama, llamafile, Open WebUI, LM Studio, Jan, Msty, AnythingLLM, Lobe Chat.
Oobabooga is praised as the most “pro” UI with extensive tuning options, but setup can be painful.
Open WebUI is feature rich but heavy; some seek minimalist frontends.
Several mention speculative decoding and quantization (Q4–Q8) as key for speed and fitting models in VRAM.

Use cases and when local makes sense

Reported uses: coding assistance, personal knowledge work, RAG over private docs, blog content, images, and experimentation.
Many say APIs (GPT/Claude/etc.) are cheaper and simpler for serious, latency-sensitive, or customer-facing work.
Local is favored when privacy, ownership, learning, or hobbyist tinkering are primary goals.

Privacy, trust, and data concerns

Strong divide: some trust cloud EULAs and see local-only privacy worries as “FUD”; others deeply distrust big tech promises and prefer full local control.
A recurring theme: small local models are “good enough” for many private tasks and avoid profiling and future data breaches.

Copyright, training data, and “unknown creators”

Acknowledgment of uncredited creators behind training data sparks debate.
Some see such credit lines as empty “land acknowledgements”; others see them as necessary reminders.
Long thread on whether training should be copyright-limited:
- One side argues copying for training should trigger compensation, fearing exploitation of the commons.
- Another side notes humans already “train” on copyrighted works; over-restricting training could centralize power in a few IP-rich incumbents.
Concerns that incentives to contribute to public knowledge platforms may erode if value is captured primarily by AI vendors.

Scaling to “business-class” clusters

People ask for guides to multi-A100 / 70B+ setups; replies say:
- True datacenter configurations are complex, power-hungry, and expensive, with few public “optimal” guides.
- For most non-enterprise use, multi-3090 rigs or cloud rentals are more practical.

Related topics