Transformers Can Do Arithmetic with the Right Embeddings
Arithmetic & Representations in Transformers
- Several comments argue LLMs struggle with math because of how numbers and positions are encoded, not because transformers cannot learn arithmetic.
- Lack of “column” structure and left‑to‑right output conflicts with right‑to‑left digit operations; this makes addition effectively harder (more than linear) for a vanilla transformer.
- Thread notes that common tokenizers split numbers into odd chunks (e.g., “12345678” → “123”, “456”, “78”), obscuring digit positions and magnitudes.
- Some suggest reversing numbers (least significant digit first) and adding position-aware embeddings; this is exactly what the paper and prior work explore.
What This Paper Adds (According to the Thread)
- Introduces specialized positional/column embeddings for digits, effectively “number-flipping” and aligning digits by significance.
- Shows sharp improvements on multi‑digit addition and some transfer to related tasks like sorting and limited multiplication.
- Several commenters see it as evidence that better encodings remove key barriers and reveal transformers’ “logical extrapolation” abilities.
Skepticism and Limits
- Critics see this as a narrow, hand‑engineered hack: strong inductive bias tailored to addition, with weak or no gains for subtraction/division.
- Many stress 99% accuracy on 100‑digit arithmetic is useless for calculators or safety‑critical domains; real systems need external exact tools.
- Some argue it doesn’t demonstrate genuine algorithm learning, just embedding design that partially bakes in the structure of addition.
Reasoning vs Pattern Matching
- Ongoing debate: are transformers “reasoning” or just high‑dimensional curve fitters?
- Pro‑side: success on controlled arithmetic tasks and broader benchmarks suggests nontrivial reasoning, albeit imperfect and fragile.
- Skeptical side: failure to generalize from small to arbitrary‑length arithmetic and brittleness on reasoning benchmarks indicate poor, highly domain‑bound reasoning.
Broader Architectural & Evaluation Ideas
- Several propose hybrids: LLM cores plus embedded ALU / CAS / search tools, perhaps via special tokens rather than plain text calls.
- Others emphasize richer, task‑aware embeddings as the new “feature engineering” frontier.
- Discussion touches on AGI definitions: calls for percentile‑based, task‑wise benchmarks instead of a binary “has AGI / hasn’t” framing.