2024-07-06

Tokens are a big reason today's generative AI falls short

Tokenization as a limitation (or not)

Some argue chatbots should expose their tokenization (e.g., show how a phrase is split, token IDs) to make behavior more debuggable.
Others say tokenization is a red herring: models could operate on bytes or characters and would still struggle with reasoning; tokens are just a compression/efficiency trick.
One view: tokens are “bridge objects” between text and model internals, so user-accessible insight into them would help diagnose odd behavior.
Another view: blaming tokens is like blaming binary for all computer shortcomings.

Arithmetic, logic, and reliability

Several commenters report modern models adding numbers and solving simple linear systems correctly, even from images.
Others present counterexamples: failures on sorting tasks, isotope half-life ordering, and a basic linear system that one model incorrectly called inconsistent.
Strong claim by some: “LLMs still can’t do arithmetic reliably,” reflecting broader skepticism about their reasoning.
Counterclaim: transformers can do arithmetic in principle; failures are largely data/training issues, not fundamental limits.

Formal math, theorem proving, and “intelligence”

One thread uses algebraic topology and simplicial/cellular homology (e.g., RP²) as a stress test.
Disagreement on whether a given homology computation by a model was correct; at least one claim that the triangulation was wrong even if the final homology groups matched known results.
Some propose: a meaningful AGI benchmark would be automatically formalizing serious math (e.g., algebraic topology, Fermat’s Last Theorem) into Coq/Lean/Isabelle.
Others respond that formalization is extremely hard even for experts, so expecting it to be “a walk in the park” is unrealistic at present.

Data quality, context, and “mental” vs algorithmic calculation

One contributor notes datasets are noisy: “2+2=5” appears often in literature, spam, and generated text, complicating statistics-based learning.
Discussion on context: some equalities are only “right” in specific literary or humorous settings, making “logically valid” answers context-dependent.
Debate over whether LLMs “reason” or just perform vast arithmetic over matrices; some insist everything they do reduces to arithmetic, others distinguish that from explicit algorithm use.

Alternative encodings (Base64, T-FREE, etc.)

Multiple examples show GPT-4 handling Base64-encoded prompts and even scrambled text, suggesting robustness to some nonstandard encodings.
Caveat: performance depends on how often such patterns appeared in training; “unnatural” yet valid token splits can break behavior.
A referenced “token-free” trigram-based approach (T-FREE) interests people, but its intuition and benefits remain unclear pending code/tests.

Expectations about progress and AGI

Some commenters are impressed by rapid capability gains and see current flaws as temporary on the way to stronger systems.
Others are openly skeptical of AGI timelines and marketing claims (e.g., “PhD-level” models) given persistent basic failures.

Related topics