2025-12-02

LLM from scratch, part 28 – training a base model from scratch on an RTX 3090

Money vs. skills and who can build “real” LLMs

Several commenters praise the project as a great learning exercise but note that modern frontier‑scale LLMs are primarily constrained by capital and hardware, not individual skill.
Others push back: skills still matter first; large budgets mostly buy scale and throughput once you understand what you’re doing.
There’s frustration that mediocre teams with strong branding and funding can outperform more talented but under‑resourced groups, compared to fancy tourist‑trap restaurants vs. unknown great chefs.

What a single RTX 3090 (or similar) is good for

Consensus: a single consumer GPU is valuable for prototyping, debugging, and small‑scale research (e.g., checking if an idea is obviously bad, fine‑tuning, LoRA, local inference).
The model in the article is described as GPT‑2‑class (~hundreds of millions of params), educational but not a “useful” general‑purpose LLM by today’s standards.
Cloud compute is often more economical for heavy training, especially high‑VRAM cards (A100/H100/B200/5090) when amortizing purchase and power costs.

Data quality, curation, and curriculum

People note that large pretraining corpora are full of noisy “slop,” yet models still work; nonetheless, data filtering and curation are real levers of improvement, especially for smaller models.
Curriculum learning is viewed as helpful in principle, but ordering trillions of tokens is seen as logistically huge; some doubt how much it’s used in frontier training.
Tiny curated datasets (e.g., children’s‑story corpora) are cited as especially effective for small models; the article’s result that a more “educational” subset underperformed the raw web data surprises some.

Training details: batch size, optimizers, precision

Commenters emphasize that the article’s very small batch sizes and short training are major reasons performance lags behind OpenAI’s GPT‑2; modern runs use effective batch sizes of millions of tokens via data/gradient parallelism.
Discussion covers gradient accumulation, learning‑rate warmup/cooldown, dropout vs. weight decay, and Adam hyperparameters as important but subtle knobs.
Mixed precision (FP16/BF16/TF32) is broadly considered safe and standard.

Learning, pedagogy, and prerequisites

The series is praised for depth and transparency, showing real experiments and limitations absent from polished papers.
There’s debate over how much math you need: some argue 12–18 months of linear algebra; others say a few hours of matrix basics plus practice suffice to follow most modern LLM work.

Related topics