Exo: Run your own AI cluster at home with everyday devices

Motivations for running models locally

  • Privacy and censorship resistance are recurring reasons: users want to run on sensitive data (journals, private audio, “spicy” images) without sending it to large providers.
  • Customization is easier locally (changing system prompts, using uncensored models, LoRAs, domain-specific setups).
  • Offline and reliable access is valued, especially where connectivity is unreliable or providers could change policy or shut down.

Arguments for cloud-hosted models

  • Many note a large quality gap: small local models (e.g., 7–8B) are seen as far behind GPT‑4/Claude-level systems for complex or high-stakes work.
  • For productivity, $20–100/month in API usage is argued to be cheaper than buying and operating powerful local hardware, especially once you factor in setup and maintenance.
  • Hosted solutions offer integrations (web search, tools like Wolfram Alpha) that local models typically lack.

Cost and hardware trade-offs

  • One side: spare hardware + free open models = $0 experimentation; good for students and hobbyists. Cloud is “not free” and can get expensive for heavy use.
  • Other side: upfront cost of capable GPUs, electricity, and time is high; for “just messing around,” cheap APIs and free tiers (TogetherAI, Groq, OpenRouter) are seen as better.
  • Some argue that for sustained >8h/day workloads, owned or colo hardware can beat big-cloud pricing; others counter that cloud still benefits from economies of scale.

How Exo works and technical feasibility

  • Exo uses pipeline parallelism: different devices hold different layers; only activations (embeddings) are sent between them.
  • Reported activation sizes: ~8–10 KB per token for 8B models, ~32 KB for 70B; expected to stay O(10–100 KB) even for much larger models.
  • On a local network, bandwidth is seen as fine; latency is the main bottleneck, especially over the internet, limiting SETI@home-style global clustering.
  • Some users report no speedup when using two MacBooks versus one, suggesting current implementation or scheduling needs work.

Maturity, platform support, and concerns

  • Project is explicitly experimental and rapidly changing; issues include:
    • Early hard dependency on Apple-only MLX, conflicting with “everyday devices” marketing.
    • Tinygrad backend exists; llama.cpp support is planned.
    • Windows, iOS, Android, Raspberry Pi, and Coral TPU support are desired but not all are proven.
    • Lack of benchmarks (tokens/sec, latency) and initial missing license; both requested by users.
    • Security model currently assumes a trusted local network; documentation is being updated.

Broader themes

  • Debate over whether “swarm compute” of idle devices is desirable versus preserving device longevity, power, and thermals.
  • Some view local/self-hosted AI as philosophically similar to open source and as a check on concentrated corporate control.