2024-05-28

Reproducing GPT-2 in llm.c

Hardware, Training Speed, and Cost

Reproducing GPT‑2 (124M) was done on multi‑GPU A100 and also on 4× AMD 7900 XTX in ~8.75 hours, with ~55% of theoretical FLOPs utilized.
A single 7900 XTX is estimated to do the same run in under 24 hours for a few dollars of electricity.
Rough extrapolation: a 350M‑parameter GPT‑3‑style model trained on 300B tokens might cost on the order of a few thousand dollars and ~140 hours on one box, less with faster GPUs (e.g., H100).
A 4090 has enough VRAM for 124M‑parameter training; it would mainly be slower than an 8×A100 setup rather than impossible.

Inference on Consumer and CPU Hardware

Older CPUs previously managed ~0.2 tokens/s with GPT‑2, but modern DDR5 systems and optimized code can exceed 1 token/s on CPU for LLaMA‑class models.
Users report decent CPU‑only inference (e.g., 7B LLaMA variants) with modest RAM, though at “space heater” power usage.
There is interest in large, GPU‑free, many‑core ARM/RISC‑V systems to escape proprietary CUDA stacks.

Datasets, Access, and Copyright

FineWeb offers 15T cleaned web tokens; practical concern is downloading tens of TB, with Cloudflare egress economics discussed.
Some are willing to pay for curated datasets; others expect to rely on torrents plus targeted paid components (e.g., code).
Debate over copyright: some argue for compensating creators; others note that enforcing copyright at dataset level mainly benefits large players and intermediaries.

llm.c Goals vs Existing Stacks

Project aims for a small, dependency‑light C/CUDA training stack, both for aesthetics and education.
Compared to PyTorch, current implementation reports a modest speedup (single‑digit percent), largely from hand‑fused kernels.
Author intends to minimize reliance on Python (currently mainly for tokenization) and drastically shrink the required binary/tooling footprint.

Architecture and Model Scaling

LLaMA‑style tweaks over GPT‑2 (RoPE, bias removal, RMSNorm, SwiGLU, longer context, hyperparameter changes) are seen as helpful but not transformational if you train long enough.
Some expect future GPT‑4‑level performance on consumer GPUs as both hardware and training efficiency improve; others doubt such capability will ever fit comfortably into 24GB VRAM.

Education, Language Choices, and Broader Impact

Many commenters request video series and course material built around llm.c.
There is disagreement on whether ML engineers “need” C; consensus leans toward Python for most, with C/CUDA relevant for low‑level infrastructure.
Transformer dominance is linked to quadratic attention enabling rich token interactions; alternatives often trade expressivity for linear scaling.

Related topics