Ask HN: How are Markov chains so different from tiny LLMs?
Conceptual Relationship Between Markov Chains and LLMs
- Several commenters note that autoregressive LLMs can be viewed as Markov processes if you treat the full internal state (context window + KV cache) as the “state.”
- Others argue this definition is technically correct but useless: it lumps n‑gram tables, transformers, and even humans into one category, obscuring important structural differences.
- A practical distinction: classic Markov text models = explicit n‑gram lookup tables; LLMs = learned continuous functions that implicitly encode those next‑token distributions.
Long-Range Dependencies and State Size
- Core technical gap: finite-order Markov / n‑gram models have exponentially decaying ability to model long-range correlations; language needs very long-range structure.
- Attention in transformers can dynamically focus on arbitrary past tokens, approximating an “infinite-order” model without enumerating all contexts.
- High‑order Markov models or HMM/Markov random fields could, in principle, match this, but state and transition tables explode combinatorially and are intractable to train at modern scales.
Discrete vs Continuous Representations
- Markov models operate on discrete symbols; unseen n‑grams typically have zero probability unless smoothed or heuristically filled.
- LLMs embed tokens into high‑dimensional vectors; similar meanings cluster, enabling generalization to sequences never seen exactly in training.
- This allows generation of fluent new sequences (e.g., style transfer, recombined concepts) rather than strict regurgitation, though sophisticated Markov systems with smoothing/skip-grams can generate some unseen combinations too.
Creativity, Novelty, and Training Data
- Ongoing disagreement:
- One side: LLMs only sample from the training distribution, so they never create “truly novel” ideas, just recombinations—analogous to humans remixing prior experience.
- Others argue that’s also true of humans; unless brains exceed Turing computability, there’s no principled bar to machines matching human-level creativity.
- Philosophical detours cover intuition, evolution-encoded knowledge, and consciousness; no consensus emerges, but multiple people stress that claims of inherent human–machine gaps lack concrete evidence.
Generalization, Hallucination, and Usefulness
- Commenters note LLMs can both fail on simple in‑distribution tasks (e.g., basic equations) and still solve novel puzzles or riddles, suggesting imperfect but real generalization.
- Markov chains are sometimes incorrectly described as only outputting substrings of the corpus; others correct this, pointing out smoothing and longer generations can already be “novel” yet incoherent.
- Both Markov models and LLMs can “hallucinate” (produce wrong but fluent text); LLMs do it in a more convincing and thus more problematic way.
- Some highlight niche advantages of engineered Markov models (e.g., tiny corpora, deterministic assistants) and experiments where carefully tuned Markov/PPM models rival or beat tiny LLMs on very small datasets.
Hybrids and Alternative Models
- The thread references:
- HMMs, cloned HMMs, Markov random fields, and graph-based variants that can, in theory, match transformer expressivity but are hard to scale.
- Hybrids: Markov chains reranked by BERT, beam search on n‑grams, huge n‑gram backends (e.g., Google Books n‑grams / infini-gram) combined with neural models.
- Several see transformers as a particularly efficient and scalable implementation of a Markovian next-token process, not something fundamentally outside probabilistic sequence modeling.