Exo: Run your own AI cluster at home with everyday devices
Motivations for running models locally
- Privacy and censorship resistance are recurring reasons: users want to run on sensitive data (journals, private audio, “spicy” images) without sending it to large providers.
- Customization is easier locally (changing system prompts, using uncensored models, LoRAs, domain-specific setups).
- Offline and reliable access is valued, especially where connectivity is unreliable or providers could change policy or shut down.
Arguments for cloud-hosted models
- Many note a large quality gap: small local models (e.g., 7–8B) are seen as far behind GPT‑4/Claude-level systems for complex or high-stakes work.
- For productivity, $20–100/month in API usage is argued to be cheaper than buying and operating powerful local hardware, especially once you factor in setup and maintenance.
- Hosted solutions offer integrations (web search, tools like Wolfram Alpha) that local models typically lack.
Cost and hardware trade-offs
- One side: spare hardware + free open models = $0 experimentation; good for students and hobbyists. Cloud is “not free” and can get expensive for heavy use.
- Other side: upfront cost of capable GPUs, electricity, and time is high; for “just messing around,” cheap APIs and free tiers (TogetherAI, Groq, OpenRouter) are seen as better.
- Some argue that for sustained >8h/day workloads, owned or colo hardware can beat big-cloud pricing; others counter that cloud still benefits from economies of scale.
How Exo works and technical feasibility
- Exo uses pipeline parallelism: different devices hold different layers; only activations (embeddings) are sent between them.
- Reported activation sizes: ~8–10 KB per token for 8B models, ~32 KB for 70B; expected to stay O(10–100 KB) even for much larger models.
- On a local network, bandwidth is seen as fine; latency is the main bottleneck, especially over the internet, limiting SETI@home-style global clustering.
- Some users report no speedup when using two MacBooks versus one, suggesting current implementation or scheduling needs work.
Maturity, platform support, and concerns
- Project is explicitly experimental and rapidly changing; issues include:
- Early hard dependency on Apple-only MLX, conflicting with “everyday devices” marketing.
- Tinygrad backend exists; llama.cpp support is planned.
- Windows, iOS, Android, Raspberry Pi, and Coral TPU support are desired but not all are proven.
- Lack of benchmarks (tokens/sec, latency) and initial missing license; both requested by users.
- Security model currently assumes a trusted local network; documentation is being updated.
Broader themes
- Debate over whether “swarm compute” of idle devices is desirable versus preserving device longevity, power, and thermals.
- Some view local/self-hosted AI as philosophically similar to open source and as a check on concentrated corporate control.