Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

Core technical reasons large providers scale

  • Inference is heavily parallelized and batched: many independent user requests are run together through the same layers, so model weights are read from VRAM once and reused across hundreds or thousands of queries.
  • Large models are sharded across many GPUs (tensor/pipeline/expert parallelism). Once the weights are resident in pooled VRAM, per‑token compute is relatively cheap.
  • Modern systems exploit KV caches, prefix/context caching, prompt deduplication, and structured decoding to avoid recomputing repeated work; even small percentage gains add up to huge GPU savings.
  • Mixture‑of‑Experts models activate only a subset of weights per token, reducing compute per token; speculative decoding with smaller “draft” models can add 2–4× speedups when tuned well.
  • Specialized inference stacks (e.g., vLLM‑style engines) do continuous batching, smart routing, and autoscaling; the “secret sauce” is largely in scheduling, caching, and GPU utilization.

Economies of scale and money

  • OpenAI and peers run on massive GPU clusters (H100‑class and custom chips) costing tens of thousands per card and millions per rack, plus huge power and cooling budgets.
  • Multi‑tenancy: most users are idle almost all the time; their few active minutes per day are time‑shared across large farms, yielding high utilization.
  • Providers can also repurpose capacity for training when user load is low.
  • Several comments note OpenAI is burning billions per year and even losing money on Pro subscriptions; current pricing is seen as subsidized, justified as a land‑grab.

Why local feels hard

  • Home GPUs have limited VRAM and no high‑speed interconnect; large models either don’t fit or run with severe offloading penalties.
  • A single user can’t batch thousands of concurrent requests, so they can’t exploit the same memory‑bandwidth amortization that big services do.
  • Local hardware sits idle most of the time, so the cost per useful token is far higher than in a busy datacenter.

Competition, centralization, and skepticism

  • Discussion of Google’s TPUs, AWS Inferentia, and Nvidia‑based clouds: some think Google could “win” via integrated hardware+ads; others point to its poor enterprise execution.
  • Some see expanding AI datacenters as a wasteful bubble leading to e‑waste and huge energy/water use; others argue LLMs significantly boost productivity and will justify the build‑out.
  • Several worry that heavy batching and centralized infra make powerful models structurally hard to self‑host, reinforcing SaaS lock‑in despite open‑source progress.