2025-11-17

Ask HN: How are Markov chains so different from tiny LLMs?

Conceptual Relationship Between Markov Chains and LLMs

Several commenters note that autoregressive LLMs can be viewed as Markov processes if you treat the full internal state (context window + KV cache) as the “state.”
Others argue this definition is technically correct but useless: it lumps n‑gram tables, transformers, and even humans into one category, obscuring important structural differences.
A practical distinction: classic Markov text models = explicit n‑gram lookup tables; LLMs = learned continuous functions that implicitly encode those next‑token distributions.

Long-Range Dependencies and State Size

Core technical gap: finite-order Markov / n‑gram models have exponentially decaying ability to model long-range correlations; language needs very long-range structure.
Attention in transformers can dynamically focus on arbitrary past tokens, approximating an “infinite-order” model without enumerating all contexts.
High‑order Markov models or HMM/Markov random fields could, in principle, match this, but state and transition tables explode combinatorially and are intractable to train at modern scales.

Discrete vs Continuous Representations

Markov models operate on discrete symbols; unseen n‑grams typically have zero probability unless smoothed or heuristically filled.
LLMs embed tokens into high‑dimensional vectors; similar meanings cluster, enabling generalization to sequences never seen exactly in training.
This allows generation of fluent new sequences (e.g., style transfer, recombined concepts) rather than strict regurgitation, though sophisticated Markov systems with smoothing/skip-grams can generate some unseen combinations too.

Creativity, Novelty, and Training Data

Ongoing disagreement:
- One side: LLMs only sample from the training distribution, so they never create “truly novel” ideas, just recombinations—analogous to humans remixing prior experience.
- Others argue that’s also true of humans; unless brains exceed Turing computability, there’s no principled bar to machines matching human-level creativity.
Philosophical detours cover intuition, evolution-encoded knowledge, and consciousness; no consensus emerges, but multiple people stress that claims of inherent human–machine gaps lack concrete evidence.

Generalization, Hallucination, and Usefulness

Commenters note LLMs can both fail on simple in‑distribution tasks (e.g., basic equations) and still solve novel puzzles or riddles, suggesting imperfect but real generalization.
Markov chains are sometimes incorrectly described as only outputting substrings of the corpus; others correct this, pointing out smoothing and longer generations can already be “novel” yet incoherent.
Both Markov models and LLMs can “hallucinate” (produce wrong but fluent text); LLMs do it in a more convincing and thus more problematic way.
Some highlight niche advantages of engineered Markov models (e.g., tiny corpora, deterministic assistants) and experiments where carefully tuned Markov/PPM models rival or beat tiny LLMs on very small datasets.

Hybrids and Alternative Models

The thread references:
- HMMs, cloned HMMs, Markov random fields, and graph-based variants that can, in theory, match transformer expressivity but are hard to scale.
- Hybrids: Markov chains reranked by BERT, beam search on n‑grams, huge n‑gram backends (e.g., Google Books n‑grams / infini-gram) combined with neural models.
Several see transformers as a particularly efficient and scalable implementation of a Markovian next-token process, not something fundamentally outside probabilistic sequence modeling.

Related topics