Reproducing GPT-2 in llm.c
Hardware, Training Speed, and Cost
- Reproducing GPT‑2 (124M) was done on multi‑GPU A100 and also on 4× AMD 7900 XTX in ~8.75 hours, with ~55% of theoretical FLOPs utilized.
- A single 7900 XTX is estimated to do the same run in under 24 hours for a few dollars of electricity.
- Rough extrapolation: a 350M‑parameter GPT‑3‑style model trained on 300B tokens might cost on the order of a few thousand dollars and ~140 hours on one box, less with faster GPUs (e.g., H100).
- A 4090 has enough VRAM for 124M‑parameter training; it would mainly be slower than an 8×A100 setup rather than impossible.
Inference on Consumer and CPU Hardware
- Older CPUs previously managed ~0.2 tokens/s with GPT‑2, but modern DDR5 systems and optimized code can exceed 1 token/s on CPU for LLaMA‑class models.
- Users report decent CPU‑only inference (e.g., 7B LLaMA variants) with modest RAM, though at “space heater” power usage.
- There is interest in large, GPU‑free, many‑core ARM/RISC‑V systems to escape proprietary CUDA stacks.
Datasets, Access, and Copyright
- FineWeb offers 15T cleaned web tokens; practical concern is downloading tens of TB, with Cloudflare egress economics discussed.
- Some are willing to pay for curated datasets; others expect to rely on torrents plus targeted paid components (e.g., code).
- Debate over copyright: some argue for compensating creators; others note that enforcing copyright at dataset level mainly benefits large players and intermediaries.
llm.c Goals vs Existing Stacks
- Project aims for a small, dependency‑light C/CUDA training stack, both for aesthetics and education.
- Compared to PyTorch, current implementation reports a modest speedup (single‑digit percent), largely from hand‑fused kernels.
- Author intends to minimize reliance on Python (currently mainly for tokenization) and drastically shrink the required binary/tooling footprint.
Architecture and Model Scaling
- LLaMA‑style tweaks over GPT‑2 (RoPE, bias removal, RMSNorm, SwiGLU, longer context, hyperparameter changes) are seen as helpful but not transformational if you train long enough.
- Some expect future GPT‑4‑level performance on consumer GPUs as both hardware and training efficiency improve; others doubt such capability will ever fit comfortably into 24GB VRAM.
Education, Language Choices, and Broader Impact
- Many commenters request video series and course material built around llm.c.
- There is disagreement on whether ML engineers “need” C; consensus leans toward Python for most, with C/CUDA relevant for low‑level infrastructure.
- Transformer dominance is linked to quadratic attention enabling rich token interactions; alternatives often trade expressivity for linear scaling.