2025-08-08

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

Core technical reasons large providers scale

Inference is heavily parallelized and batched: many independent user requests are run together through the same layers, so model weights are read from VRAM once and reused across hundreds or thousands of queries.
Large models are sharded across many GPUs (tensor/pipeline/expert parallelism). Once the weights are resident in pooled VRAM, per‑token compute is relatively cheap.
Modern systems exploit KV caches, prefix/context caching, prompt deduplication, and structured decoding to avoid recomputing repeated work; even small percentage gains add up to huge GPU savings.
Mixture‑of‑Experts models activate only a subset of weights per token, reducing compute per token; speculative decoding with smaller “draft” models can add 2–4× speedups when tuned well.
Specialized inference stacks (e.g., vLLM‑style engines) do continuous batching, smart routing, and autoscaling; the “secret sauce” is largely in scheduling, caching, and GPU utilization.

Economies of scale and money

OpenAI and peers run on massive GPU clusters (H100‑class and custom chips) costing tens of thousands per card and millions per rack, plus huge power and cooling budgets.
Multi‑tenancy: most users are idle almost all the time; their few active minutes per day are time‑shared across large farms, yielding high utilization.
Providers can also repurpose capacity for training when user load is low.
Several comments note OpenAI is burning billions per year and even losing money on Pro subscriptions; current pricing is seen as subsidized, justified as a land‑grab.

Why local feels hard

Home GPUs have limited VRAM and no high‑speed interconnect; large models either don’t fit or run with severe offloading penalties.
A single user can’t batch thousands of concurrent requests, so they can’t exploit the same memory‑bandwidth amortization that big services do.
Local hardware sits idle most of the time, so the cost per useful token is far higher than in a busy datacenter.

Competition, centralization, and skepticism

Discussion of Google’s TPUs, AWS Inferentia, and Nvidia‑based clouds: some think Google could “win” via integrated hardware+ads; others point to its poor enterprise execution.
Some see expanding AI datacenters as a wasteful bubble leading to e‑waste and huge energy/water use; others argue LLMs significantly boost productivity and will justify the build‑out.
Several worry that heavy batching and centralized infra make powerful models structurally hard to self‑host, reinforcing SaaS lock‑in despite open‑source progress.

Related topics