TinyStories: How Small Can Language Models Be and Still Speak Coherent English? (2023)

Tiny Models vs Older Small LMs

  • Thread notes that older ~125M-parameter models (GPT-2/Neo small) were quite weak, but newer tiny architectures (e.g., RWKV, SmolLM, others) are perceived as much better at similar sizes.
  • Some users test RWKV and conclude it’s still frequently incoherent, especially on basic Q&A and consistency; others are impressed by its capabilities for its size.
  • Multiple comments emphasize that sub‑1B models often feel like Markov chains; around 3B parameters is where coherent, controllable behavior and RAG start to work reliably.

Weird Failure Modes & “Attention” Debate

  • Tiny models often produce “fever dream” or very dark, off‑topic stories, including on seemingly benign prompts.
  • Several comments trace this to limited internal state and morbid content in synthetic training data, not to any human‑like psychology.
  • Extended discussion argues that comparing LLM “attention” to ADHD is misleading:
    • ADHD is a complex neuropsychiatric condition, not just “lack of attention.”
    • Transformer attention is a mathematical mechanism; the shared word “attention” is an accident of terminology.
    • Metaphors can help thinking but can also confuse when they suggest incorrect parallels to human disorders.

RAG and Model Size / Architecture

  • Several participants state that only a few models ≥3B parameters handle retrieval‑augmented generation (RAG) well.
  • Common failure modes: models ignore instructions, “continue” the retrieved text instead of answering, or get lost in long context.
  • Ideas to “distribute” RAG across many small models:
    • Classifier model routes queries to domain‑specific submodels.
    • Richer indexing and metadata in vector stores to pick the right model per chunk.
    • Scoring‑function approaches (e.g., ColBERT‑style) and MoE‑like designs.
  • Benefits seen mainly in privacy/control, not obviously in raw capability.

Training Tricks, Synthetic Data, and New Datasets

  • Discussion of “sacrificial training” and quantization as evidence that current models are over‑parameterized; hope for strong 0.1–1B models that are easy to fine‑tune locally.
  • TinyStories is seen as an early, influential synthetic dataset; successors like SimpleStories and small‑LM training toolkits are shared.
  • Comments highlight that LLM‑generated text is structurally easier for LMs to learn; concerns raised that models trained only on synthetic data may be less robust.

Use Cases for Tiny Models

  • Suggested niches: voice/home‑automation commands (“lights on/off”), better phone spell‑checking, small on‑device assistants, IDE completion, interactive toys.
  • Debate over whether LLMs are overkill versus simple intent/keyword systems, and the importance of reliable “I don’t know → escalate to bigger model” behavior.