How large are large language models?
Model Size and Hardware Requirements
- Several rules of thumb were discussed:
- 1B parameters ≈ 2 GB in FP16 (2 bytes/weight) or ≈ 1 GB at 8-bit quantization.
- A rough “VRAM budget” is often ~4× parameter-count-in-GB for overhead, so 2B ≈ 8 GB VRAM, 7B ≈ ~28 GB, 70B ≈ ~280 GB, unless heavily quantized.
- Inference is typically bandwidth-bound; high-bandwidth VRAM (GPUs, Apple M-series, unified-memory APUs) matters more than large system RAM.
- Quantization (8-bit, 5-bit, 4-bit) can cut memory 2–4× with modest or task-dependent quality loss; models trained natively at low bit-width may outperform post-quantized ones.
Data Scale and “Size of the Internet”
- One thread compares model sizes (hundreds of billions of params → ~TB of weights) to human text:
- Back-of-envelope estimates for “all digitized books” cluster around a few–tens of TB, with one concrete calc (using Anna’s Archive stats and compression) giving ~30 TB raw, ~5.5 TB compressed.
- There is strong disagreement with a claim that “the public web is ~50 TB”; others point to zettabyte-scale web estimates and Common Crawl adding ~250 TB/month. It’s unclear what exact definition (text-only, deduped, etc.) the smaller figures use.
- Some argue LLMs now operate on ~1–10% of “all available English text” and that training returns may be saturating, pushing advances toward inference-time “reasoning” and tools/agents.
LLMs as Compression (and Its Limits)
- Many commenters like the metaphor of LLMs as lossy compression of human knowledge (“blurry JPEG of the web”); they highlight:
- Astonishment at what an
8 GB local model can do (history, games, animal facts) and comparisons to compressed Wikipedia (24 GB). - Information-theoretic work showing language modeling closely tied to compression and evaluations that treat modeling as compression tasks.
- Astonishment at what an
- Others caution that calling LLMs “compression” is misleading:
- Traditional compression is predictably lossy or lossless and verifiable; LLM output is unpredictably wrong and requires human checking.
- For most classic compression use-cases (archives, legal docs), LLM-style “compression” is unacceptable.
- A more technical thread notes that:
- Given shared weights, an LLM + arithmetic coding implements lossless compression approaching the model’s log-likelihood.
- Training itself can be viewed as a form of lossless compression where description length is the training signal, not the final weights.
Model Scale, Capability, and Synthetic Data
- Commenters note that open models only approached GPT-4-level reasoning when they crossed into very large dense (≈400B+) or high-activation MoE ranges, after years of 30–70B attempts failing to match GPT-3.
- Some speculate that even larger frontier models were tried and quietly abandoned due to disappointing returns, suggesting optimal “frontier” sizes may now be smaller than the largest public models.
- Debate on synthetic data:
- One side warns about “model collapse” when models are trained on their own outputs.
- Others counter that, in practice, carefully designed synthetic data (especially teacher–student distillation or code with executable tests) reliably improves performance; labs wouldn’t use it otherwise.
Critique of the Article and Model Coverage
- Multiple factual and contextual issues are raised:
- Confusion between different Meta models/variants and misstatements about training tokens.
- Overstated claims about MoE enabling training without large GPU clusters.
- Lack of discussion of quantized sizes despite a “how big are they?” framing.
- Omission of notable families (Gemma, Gemini, T5, Mistral Large) while including smaller or less central models.
- The author acknowledges some errors and clarifies specific points, but several commenters still characterize it as incomplete or “sloppy” and overly focused on token counts rather than practical size/usage.
Reasoning, Intelligence, and Future Directions
- Long subthreads debate:
- Whether LLM “reasoning” is fundamentally weaker than human reasoning despite vastly larger “working memory.”
- Claims that humans learn from far less data vs. counters that human sensory input from birth (especially vision) is enormous.
- Whether we are “out of training data” (for text) vs. large untapped sources (video, robotics, specialized interaction logs).
- Some see intelligence as fundamentally related to compression/prediction; others emphasize novelty and idea generation beyond seen data.
- There is speculation that:
- Architecture and training-method improvements could reduce required model sizes for a given capability.
- Consumer-grade hardware (high-end PCs or even phones) may eventually suffice for extremely capable models, with the internet serving as factual backing via tools and retrieval rather than being fully “baked in” to weights.