2025-06-24

The bitter lesson is coming for tokenization

Expressivity and Theoretical Bottlenecks

OP claims: with ~15k tokens and 1k-dimensional embeddings, the next-token distribution is limited to rank 1k, constraining which probability distributions are representable.
Replies note high-dimensional geometry: exponentially many almost-orthogonal vectors can exist, so practical expressivity is much larger than intuition suggests, though not enough to represent arbitrary distributions.
Some argue nonlinearity and deep networks break the simple linear “1k degrees of freedom” story; others point to work on “unargmaxable” outputs in bottlenecked networks as real but rare edge cases.

Tokenization, Characters, and the “Strawberry r’s” Meme

Several comments explain that subword tokenization hides character structure: “strawberry” might be a few opaque tokens, so models must effectively memorize letter composition per token to count letters.
Evidence from in-review work: counting accuracy declines as the target character is buried inside multi-character tokens.
Others are skeptical, arguing:
- We lack clear demonstrations that character-level models can reliably “count Rs”.
- RLHF and training on many counting prompts suggest the limitation is not purely tokenization.
There’s recognition that models don’t “see” characters; they see embeddings, and any character-level reasoning is an extra learned indirection.

Math, Logic, and Number Tokenization

Several posts claim logical/mathematical failures are strongly tied to tokenization, especially how numbers are split.
Cited work shows large gains when numbers are tokenized right-to-left in fixed 3-digit groups (e.g., 1 234 567) and when all small digit-groups are in-vocab.
Other research: treating numbers as special tokens with attached numeric values so arithmetic is done on real numbers rather than digit strings.
Some argue LLMs are the wrong tool for exact arithmetic; better is: LLM selects the right formula and delegates computation to a calculator engine.

Bytes, UTF-8, and Raw Representations

“Bytes is tokenization”: using raw bytes (often via UTF-8) is seen by some as the ultimate generic scheme, avoiding out-of-vocabulary issues with a 256-token alphabet.
Counterpoint: UTF-8 itself is a biased human-designed tokenizer over Unicode; models are not guaranteed to output valid UTF-8, and rare codepoints can be badly trained.
New encoding schemes are being explored to better match modeling needs and reduce “glitch tokens”.

Bitter Lesson, Compute vs Clever Tricks

Debate centers on whether tokenization is the next domain where the Bitter Lesson (general methods + compute beat handcrafted structure) will apply.
Some say it already did: simple statistically learned subword tokenizers outperformed linguistically sophisticated morphology-based approaches.
Others highlight counterexamples where architectural tweaks to tokenization (e.g., special indentation tokens in Python, better numeric chunking) give large, practical improvements—evidence that cleverness still matters.
There’s concern that over-relying on “just scale compute” can obscure simpler, more principled solutions and slow genuine understanding.

Costs, Scaling, and Energy

A claim that training frontier models costs “around median country GDP” is challenged with data: estimated compute costs for GPT‑4 or Gemini Ultra are in the tens or low hundreds of millions of dollars, far below ~$40–50B median GDP.
People discuss GDP measures (PPP vs nominal) and note training cost estimates are rough and incomplete (hardware, engineering, data, etc.).
Another angle compares theoretical human brain energy (a few fast-food meals/day) versus enormous current AI energy use, suggesting large headroom for efficiency improvements.

Determinism, Capability, and AGI Limits

Clarification: the model function is deterministic; nondeterminism comes from sampling, numerical instability, and changing deployments.
Some argue DAG-like, immutable-at-runtime transformers can never reach AGI; others counter that with sufficiently long context and high throughput, such models could be effectively general, and that “immutability” is a modeling convenience, not a hard theoretical limit.
Theory papers showing transformers can simulate universal algorithms are cited; critics note these are existence proofs, not guarantees that gradient-based training will find such solutions.

Future Directions: Learned or Mixed Tokenizations

Multiple commenters imagine mixtures of tokenizations:
- A learned module that dynamically chooses token boundaries (e.g., via a small transformer predicting token endpoints) so models can “skim” unimportant text and compress context.
- Mixture-of-experts where each expert has its own domain-specific tokenization.
Character-level and byte-level models (e.g., Byte-Latent Transformers) are seen as moves toward end-to-end learned representations, but questions remain about efficiency and performance on math and reasoning.
Overall sentiment: tokenization is likely suboptimal today; compute scaling will help, but domain-aware or learned tokenization will probably deliver important gains before “just bytes + huge models” fully wins.

Related topics