Ask HN: How are Markov chains so different from tiny LLMs?

Conceptual Relationship Between Markov Chains and LLMs

  • Several commenters note that autoregressive LLMs can be viewed as Markov processes if you treat the full internal state (context window + KV cache) as the “state.”
  • Others argue this definition is technically correct but useless: it lumps n‑gram tables, transformers, and even humans into one category, obscuring important structural differences.
  • A practical distinction: classic Markov text models = explicit n‑gram lookup tables; LLMs = learned continuous functions that implicitly encode those next‑token distributions.

Long-Range Dependencies and State Size

  • Core technical gap: finite-order Markov / n‑gram models have exponentially decaying ability to model long-range correlations; language needs very long-range structure.
  • Attention in transformers can dynamically focus on arbitrary past tokens, approximating an “infinite-order” model without enumerating all contexts.
  • High‑order Markov models or HMM/Markov random fields could, in principle, match this, but state and transition tables explode combinatorially and are intractable to train at modern scales.

Discrete vs Continuous Representations

  • Markov models operate on discrete symbols; unseen n‑grams typically have zero probability unless smoothed or heuristically filled.
  • LLMs embed tokens into high‑dimensional vectors; similar meanings cluster, enabling generalization to sequences never seen exactly in training.
  • This allows generation of fluent new sequences (e.g., style transfer, recombined concepts) rather than strict regurgitation, though sophisticated Markov systems with smoothing/skip-grams can generate some unseen combinations too.

Creativity, Novelty, and Training Data

  • Ongoing disagreement:
    • One side: LLMs only sample from the training distribution, so they never create “truly novel” ideas, just recombinations—analogous to humans remixing prior experience.
    • Others argue that’s also true of humans; unless brains exceed Turing computability, there’s no principled bar to machines matching human-level creativity.
  • Philosophical detours cover intuition, evolution-encoded knowledge, and consciousness; no consensus emerges, but multiple people stress that claims of inherent human–machine gaps lack concrete evidence.

Generalization, Hallucination, and Usefulness

  • Commenters note LLMs can both fail on simple in‑distribution tasks (e.g., basic equations) and still solve novel puzzles or riddles, suggesting imperfect but real generalization.
  • Markov chains are sometimes incorrectly described as only outputting substrings of the corpus; others correct this, pointing out smoothing and longer generations can already be “novel” yet incoherent.
  • Both Markov models and LLMs can “hallucinate” (produce wrong but fluent text); LLMs do it in a more convincing and thus more problematic way.
  • Some highlight niche advantages of engineered Markov models (e.g., tiny corpora, deterministic assistants) and experiments where carefully tuned Markov/PPM models rival or beat tiny LLMs on very small datasets.

Hybrids and Alternative Models

  • The thread references:
    • HMMs, cloned HMMs, Markov random fields, and graph-based variants that can, in theory, match transformer expressivity but are hard to scale.
    • Hybrids: Markov chains reranked by BERT, beam search on n‑grams, huge n‑gram backends (e.g., Google Books n‑grams / infini-gram) combined with neural models.
  • Several see transformers as a particularly efficient and scalable implementation of a Markovian next-token process, not something fundamentally outside probabilistic sequence modeling.