Can I run AI locally?
Scope & Purpose of the Site
- Tool estimates which LLMs can run locally and at what tokens/second, based largely on VRAM, RAM, and bandwidth.
- Many find the idea very useful, especially for buying decisions and as a quick “can I run X?” reference.
- Several say it’s reminiscent of old “Can You Run It?” PC game requirement checkers.
Accuracy, Data Quality & Gaps
- Multiple reports that estimates are significantly off: models marked “can’t run” or “slow” actually run much faster in practice (e.g., Qwen 3.5 35B, GPT-OSS 120B, big MoE models).
- Site appears to conflate prefill and generation speeds and may overstate Apple Silicon performance; some call this “nonsense” or “LLM‑generated.”
- MoE models: calculator seems to use total parameters instead of active parameters, underestimating speed.
- Quantization and mmap, KV offloading, and unified/shared memory (Apple/AMD/Intel iGPUs) are mostly ignored, so many real‑world configurations aren’t captured.
- Hardware list is incomplete or incorrect for many: missing RTX Pro 6000, A4000, 4050/5060Ti, some Teslas, mobile GPUs, Tensor chips, Strix Halo, various AMD/Intel SKUs; RAM caps for M3 Ultra wrong; includes non‑existent “M4 Ultra.”
UX & Feature Requests
- Requests for:
- Ability to choose a model first and see performance across hardware.
- Filters by task (coding, extraction, vision, embeddings) and by model quality, not just speed.
- Clearer explanation of ratings (S/A/B/…) and metrics like latency/time‑to‑first‑token.
- Better handling of quant levels, context sizes, and tool‑use behavior.
- Higher-contrast, larger UI text; better mobile layout.
- Some want crowdsourced, benchmark‑style data instead of pure estimation.
Privacy & Hardware Detection
- Site uses browser APIs/WebGL/WebGPU as a heuristic for hardware; some are surprised their GPU specs are visible to websites and see fingerprinting risks.
- Others note detection is often wrong (e.g., mis-reporting VRAM or GPU model).
Local vs Cloud Tradeoffs
- Several argue economics and quality still favor cloud (Groq, frontier APIs), with huge speed/quality gaps.
- Others prioritize privacy, offline access, experimentation freedom, and narrow local tasks (OCR, STT, embeddings, small coding helpers) despite slower, weaker models.