LLM from scratch, part 28 – training a base model from scratch on an RTX 3090
Money vs. skills and who can build “real” LLMs
- Several commenters praise the project as a great learning exercise but note that modern frontier‑scale LLMs are primarily constrained by capital and hardware, not individual skill.
- Others push back: skills still matter first; large budgets mostly buy scale and throughput once you understand what you’re doing.
- There’s frustration that mediocre teams with strong branding and funding can outperform more talented but under‑resourced groups, compared to fancy tourist‑trap restaurants vs. unknown great chefs.
What a single RTX 3090 (or similar) is good for
- Consensus: a single consumer GPU is valuable for prototyping, debugging, and small‑scale research (e.g., checking if an idea is obviously bad, fine‑tuning, LoRA, local inference).
- The model in the article is described as GPT‑2‑class (~hundreds of millions of params), educational but not a “useful” general‑purpose LLM by today’s standards.
- Cloud compute is often more economical for heavy training, especially high‑VRAM cards (A100/H100/B200/5090) when amortizing purchase and power costs.
Data quality, curation, and curriculum
- People note that large pretraining corpora are full of noisy “slop,” yet models still work; nonetheless, data filtering and curation are real levers of improvement, especially for smaller models.
- Curriculum learning is viewed as helpful in principle, but ordering trillions of tokens is seen as logistically huge; some doubt how much it’s used in frontier training.
- Tiny curated datasets (e.g., children’s‑story corpora) are cited as especially effective for small models; the article’s result that a more “educational” subset underperformed the raw web data surprises some.
Training details: batch size, optimizers, precision
- Commenters emphasize that the article’s very small batch sizes and short training are major reasons performance lags behind OpenAI’s GPT‑2; modern runs use effective batch sizes of millions of tokens via data/gradient parallelism.
- Discussion covers gradient accumulation, learning‑rate warmup/cooldown, dropout vs. weight decay, and Adam hyperparameters as important but subtle knobs.
- Mixed precision (FP16/BF16/TF32) is broadly considered safe and standard.
Learning, pedagogy, and prerequisites
- The series is praised for depth and transparency, showing real experiments and limitations absent from polished papers.
- There’s debate over how much math you need: some argue 12–18 months of linear algebra; others say a few hours of matrix basics plus practice suffice to follow most modern LLM work.