2025-01-02

TinyStories: How Small Can Language Models Be and Still Speak Coherent English? (2023)

Tiny Models vs Older Small LMs

Thread notes that older ~125M-parameter models (GPT-2/Neo small) were quite weak, but newer tiny architectures (e.g., RWKV, SmolLM, others) are perceived as much better at similar sizes.
Some users test RWKV and conclude it’s still frequently incoherent, especially on basic Q&A and consistency; others are impressed by its capabilities for its size.
Multiple comments emphasize that sub‑1B models often feel like Markov chains; around 3B parameters is where coherent, controllable behavior and RAG start to work reliably.

Weird Failure Modes & “Attention” Debate

Tiny models often produce “fever dream” or very dark, off‑topic stories, including on seemingly benign prompts.
Several comments trace this to limited internal state and morbid content in synthetic training data, not to any human‑like psychology.
Extended discussion argues that comparing LLM “attention” to ADHD is misleading:
- ADHD is a complex neuropsychiatric condition, not just “lack of attention.”
- Transformer attention is a mathematical mechanism; the shared word “attention” is an accident of terminology.
- Metaphors can help thinking but can also confuse when they suggest incorrect parallels to human disorders.

RAG and Model Size / Architecture

Several participants state that only a few models ≥3B parameters handle retrieval‑augmented generation (RAG) well.
Common failure modes: models ignore instructions, “continue” the retrieved text instead of answering, or get lost in long context.
Ideas to “distribute” RAG across many small models:
- Classifier model routes queries to domain‑specific submodels.
- Richer indexing and metadata in vector stores to pick the right model per chunk.
- Scoring‑function approaches (e.g., ColBERT‑style) and MoE‑like designs.
Benefits seen mainly in privacy/control, not obviously in raw capability.

Training Tricks, Synthetic Data, and New Datasets

Discussion of “sacrificial training” and quantization as evidence that current models are over‑parameterized; hope for strong 0.1–1B models that are easy to fine‑tune locally.
TinyStories is seen as an early, influential synthetic dataset; successors like SimpleStories and small‑LM training toolkits are shared.
Comments highlight that LLM‑generated text is structurally easier for LMs to learn; concerns raised that models trained only on synthetic data may be less robust.

Use Cases for Tiny Models

Suggested niches: voice/home‑automation commands (“lights on/off”), better phone spell‑checking, small on‑device assistants, IDE completion, interactive toys.
Debate over whether LLMs are overkill versus simple intent/keyword systems, and the importance of reliable “I don’t know → escalate to bigger model” behavior.

Related topics