The bitter lesson is coming for tokenization

Expressivity and Theoretical Bottlenecks

  • OP claims: with ~15k tokens and 1k-dimensional embeddings, the next-token distribution is limited to rank 1k, constraining which probability distributions are representable.
  • Replies note high-dimensional geometry: exponentially many almost-orthogonal vectors can exist, so practical expressivity is much larger than intuition suggests, though not enough to represent arbitrary distributions.
  • Some argue nonlinearity and deep networks break the simple linear “1k degrees of freedom” story; others point to work on “unargmaxable” outputs in bottlenecked networks as real but rare edge cases.

Tokenization, Characters, and the “Strawberry r’s” Meme

  • Several comments explain that subword tokenization hides character structure: “strawberry” might be a few opaque tokens, so models must effectively memorize letter composition per token to count letters.
  • Evidence from in-review work: counting accuracy declines as the target character is buried inside multi-character tokens.
  • Others are skeptical, arguing:
    • We lack clear demonstrations that character-level models can reliably “count Rs”.
    • RLHF and training on many counting prompts suggest the limitation is not purely tokenization.
  • There’s recognition that models don’t “see” characters; they see embeddings, and any character-level reasoning is an extra learned indirection.

Math, Logic, and Number Tokenization

  • Several posts claim logical/mathematical failures are strongly tied to tokenization, especially how numbers are split.
  • Cited work shows large gains when numbers are tokenized right-to-left in fixed 3-digit groups (e.g., 1 234 567) and when all small digit-groups are in-vocab.
  • Other research: treating numbers as special tokens with attached numeric values so arithmetic is done on real numbers rather than digit strings.
  • Some argue LLMs are the wrong tool for exact arithmetic; better is: LLM selects the right formula and delegates computation to a calculator engine.

Bytes, UTF-8, and Raw Representations

  • “Bytes is tokenization”: using raw bytes (often via UTF-8) is seen by some as the ultimate generic scheme, avoiding out-of-vocabulary issues with a 256-token alphabet.
  • Counterpoint: UTF-8 itself is a biased human-designed tokenizer over Unicode; models are not guaranteed to output valid UTF-8, and rare codepoints can be badly trained.
  • New encoding schemes are being explored to better match modeling needs and reduce “glitch tokens”.

Bitter Lesson, Compute vs Clever Tricks

  • Debate centers on whether tokenization is the next domain where the Bitter Lesson (general methods + compute beat handcrafted structure) will apply.
  • Some say it already did: simple statistically learned subword tokenizers outperformed linguistically sophisticated morphology-based approaches.
  • Others highlight counterexamples where architectural tweaks to tokenization (e.g., special indentation tokens in Python, better numeric chunking) give large, practical improvements—evidence that cleverness still matters.
  • There’s concern that over-relying on “just scale compute” can obscure simpler, more principled solutions and slow genuine understanding.

Costs, Scaling, and Energy

  • A claim that training frontier models costs “around median country GDP” is challenged with data: estimated compute costs for GPT‑4 or Gemini Ultra are in the tens or low hundreds of millions of dollars, far below ~$40–50B median GDP.
  • People discuss GDP measures (PPP vs nominal) and note training cost estimates are rough and incomplete (hardware, engineering, data, etc.).
  • Another angle compares theoretical human brain energy (a few fast-food meals/day) versus enormous current AI energy use, suggesting large headroom for efficiency improvements.

Determinism, Capability, and AGI Limits

  • Clarification: the model function is deterministic; nondeterminism comes from sampling, numerical instability, and changing deployments.
  • Some argue DAG-like, immutable-at-runtime transformers can never reach AGI; others counter that with sufficiently long context and high throughput, such models could be effectively general, and that “immutability” is a modeling convenience, not a hard theoretical limit.
  • Theory papers showing transformers can simulate universal algorithms are cited; critics note these are existence proofs, not guarantees that gradient-based training will find such solutions.

Future Directions: Learned or Mixed Tokenizations

  • Multiple commenters imagine mixtures of tokenizations:
    • A learned module that dynamically chooses token boundaries (e.g., via a small transformer predicting token endpoints) so models can “skim” unimportant text and compress context.
    • Mixture-of-experts where each expert has its own domain-specific tokenization.
  • Character-level and byte-level models (e.g., Byte-Latent Transformers) are seen as moves toward end-to-end learned representations, but questions remain about efficiency and performance on math and reasoning.
  • Overall sentiment: tokenization is likely suboptimal today; compute scaling will help, but domain-aware or learned tokenization will probably deliver important gains before “just bytes + huge models” fully wins.