LLM from scratch, part 28 – training a base model from scratch on an RTX 3090

Money vs. skills and who can build “real” LLMs

  • Several commenters praise the project as a great learning exercise but note that modern frontier‑scale LLMs are primarily constrained by capital and hardware, not individual skill.
  • Others push back: skills still matter first; large budgets mostly buy scale and throughput once you understand what you’re doing.
  • There’s frustration that mediocre teams with strong branding and funding can outperform more talented but under‑resourced groups, compared to fancy tourist‑trap restaurants vs. unknown great chefs.

What a single RTX 3090 (or similar) is good for

  • Consensus: a single consumer GPU is valuable for prototyping, debugging, and small‑scale research (e.g., checking if an idea is obviously bad, fine‑tuning, LoRA, local inference).
  • The model in the article is described as GPT‑2‑class (~hundreds of millions of params), educational but not a “useful” general‑purpose LLM by today’s standards.
  • Cloud compute is often more economical for heavy training, especially high‑VRAM cards (A100/H100/B200/5090) when amortizing purchase and power costs.

Data quality, curation, and curriculum

  • People note that large pretraining corpora are full of noisy “slop,” yet models still work; nonetheless, data filtering and curation are real levers of improvement, especially for smaller models.
  • Curriculum learning is viewed as helpful in principle, but ordering trillions of tokens is seen as logistically huge; some doubt how much it’s used in frontier training.
  • Tiny curated datasets (e.g., children’s‑story corpora) are cited as especially effective for small models; the article’s result that a more “educational” subset underperformed the raw web data surprises some.

Training details: batch size, optimizers, precision

  • Commenters emphasize that the article’s very small batch sizes and short training are major reasons performance lags behind OpenAI’s GPT‑2; modern runs use effective batch sizes of millions of tokens via data/gradient parallelism.
  • Discussion covers gradient accumulation, learning‑rate warmup/cooldown, dropout vs. weight decay, and Adam hyperparameters as important but subtle knobs.
  • Mixed precision (FP16/BF16/TF32) is broadly considered safe and standard.

Learning, pedagogy, and prerequisites

  • The series is praised for depth and transparency, showing real experiments and limitations absent from polished papers.
  • There’s debate over how much math you need: some argue 12–18 months of linear algebra; others say a few hours of matrix basics plus practice suffice to follow most modern LLM work.