2026-03-13

Can I run AI locally?

Scope & Purpose of the Site

Tool estimates which LLMs can run locally and at what tokens/second, based largely on VRAM, RAM, and bandwidth.
Many find the idea very useful, especially for buying decisions and as a quick “can I run X?” reference.
Several say it’s reminiscent of old “Can You Run It?” PC game requirement checkers.

Accuracy, Data Quality & Gaps

Multiple reports that estimates are significantly off: models marked “can’t run” or “slow” actually run much faster in practice (e.g., Qwen 3.5 35B, GPT-OSS 120B, big MoE models).
Site appears to conflate prefill and generation speeds and may overstate Apple Silicon performance; some call this “nonsense” or “LLM‑generated.”
MoE models: calculator seems to use total parameters instead of active parameters, underestimating speed.
Quantization and mmap, KV offloading, and unified/shared memory (Apple/AMD/Intel iGPUs) are mostly ignored, so many real‑world configurations aren’t captured.
Hardware list is incomplete or incorrect for many: missing RTX Pro 6000, A4000, 4050/5060Ti, some Teslas, mobile GPUs, Tensor chips, Strix Halo, various AMD/Intel SKUs; RAM caps for M3 Ultra wrong; includes non‑existent “M4 Ultra.”

UX & Feature Requests

Privacy & Hardware Detection

Site uses browser APIs/WebGL/WebGPU as a heuristic for hardware; some are surprised their GPU specs are visible to websites and see fingerprinting risks.
Others note detection is often wrong (e.g., mis-reporting VRAM or GPU model).

Local vs Cloud Tradeoffs

Several argue economics and quality still favor cloud (Groq, frontier APIs), with huge speed/quality gaps.
Others prioritize privacy, offline access, experimentation freedom, and narrow local tasks (OCR, STT, embeddings, small coding helpers) despite slower, weaker models.

Related topics