Reproducing GPT-2 in llm.c

Hardware, Training Speed, and Cost

  • Reproducing GPT‑2 (124M) was done on multi‑GPU A100 and also on 4× AMD 7900 XTX in ~8.75 hours, with ~55% of theoretical FLOPs utilized.
  • A single 7900 XTX is estimated to do the same run in under 24 hours for a few dollars of electricity.
  • Rough extrapolation: a 350M‑parameter GPT‑3‑style model trained on 300B tokens might cost on the order of a few thousand dollars and ~140 hours on one box, less with faster GPUs (e.g., H100).
  • A 4090 has enough VRAM for 124M‑parameter training; it would mainly be slower than an 8×A100 setup rather than impossible.

Inference on Consumer and CPU Hardware

  • Older CPUs previously managed ~0.2 tokens/s with GPT‑2, but modern DDR5 systems and optimized code can exceed 1 token/s on CPU for LLaMA‑class models.
  • Users report decent CPU‑only inference (e.g., 7B LLaMA variants) with modest RAM, though at “space heater” power usage.
  • There is interest in large, GPU‑free, many‑core ARM/RISC‑V systems to escape proprietary CUDA stacks.

Datasets, Access, and Copyright

  • FineWeb offers 15T cleaned web tokens; practical concern is downloading tens of TB, with Cloudflare egress economics discussed.
  • Some are willing to pay for curated datasets; others expect to rely on torrents plus targeted paid components (e.g., code).
  • Debate over copyright: some argue for compensating creators; others note that enforcing copyright at dataset level mainly benefits large players and intermediaries.

llm.c Goals vs Existing Stacks

  • Project aims for a small, dependency‑light C/CUDA training stack, both for aesthetics and education.
  • Compared to PyTorch, current implementation reports a modest speedup (single‑digit percent), largely from hand‑fused kernels.
  • Author intends to minimize reliance on Python (currently mainly for tokenization) and drastically shrink the required binary/tooling footprint.

Architecture and Model Scaling

  • LLaMA‑style tweaks over GPT‑2 (RoPE, bias removal, RMSNorm, SwiGLU, longer context, hyperparameter changes) are seen as helpful but not transformational if you train long enough.
  • Some expect future GPT‑4‑level performance on consumer GPUs as both hardware and training efficiency improve; others doubt such capability will ever fit comfortably into 24GB VRAM.

Education, Language Choices, and Broader Impact

  • Many commenters request video series and course material built around llm.c.
  • There is disagreement on whether ML engineers “need” C; consensus leans toward Python for most, with C/CUDA relevant for low‑level infrastructure.
  • Transformer dominance is linked to quadratic attention enabling rich token interactions; alternatives often trade expressivity for linear scaling.