Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?
Core technical reasons large providers scale
- Inference is heavily parallelized and batched: many independent user requests are run together through the same layers, so model weights are read from VRAM once and reused across hundreds or thousands of queries.
- Large models are sharded across many GPUs (tensor/pipeline/expert parallelism). Once the weights are resident in pooled VRAM, per‑token compute is relatively cheap.
- Modern systems exploit KV caches, prefix/context caching, prompt deduplication, and structured decoding to avoid recomputing repeated work; even small percentage gains add up to huge GPU savings.
- Mixture‑of‑Experts models activate only a subset of weights per token, reducing compute per token; speculative decoding with smaller “draft” models can add 2–4× speedups when tuned well.
- Specialized inference stacks (e.g., vLLM‑style engines) do continuous batching, smart routing, and autoscaling; the “secret sauce” is largely in scheduling, caching, and GPU utilization.
Economies of scale and money
- OpenAI and peers run on massive GPU clusters (H100‑class and custom chips) costing tens of thousands per card and millions per rack, plus huge power and cooling budgets.
- Multi‑tenancy: most users are idle almost all the time; their few active minutes per day are time‑shared across large farms, yielding high utilization.
- Providers can also repurpose capacity for training when user load is low.
- Several comments note OpenAI is burning billions per year and even losing money on Pro subscriptions; current pricing is seen as subsidized, justified as a land‑grab.
Why local feels hard
- Home GPUs have limited VRAM and no high‑speed interconnect; large models either don’t fit or run with severe offloading penalties.
- A single user can’t batch thousands of concurrent requests, so they can’t exploit the same memory‑bandwidth amortization that big services do.
- Local hardware sits idle most of the time, so the cost per useful token is far higher than in a busy datacenter.
Competition, centralization, and skepticism
- Discussion of Google’s TPUs, AWS Inferentia, and Nvidia‑based clouds: some think Google could “win” via integrated hardware+ads; others point to its poor enterprise execution.
- Some see expanding AI datacenters as a wasteful bubble leading to e‑waste and huge energy/water use; others argue LLMs significantly boost productivity and will justify the build‑out.
- Several worry that heavy batching and centralized infra make powerful models structurally hard to self‑host, reinforcing SaaS lock‑in despite open‑source progress.