2024-05-28

Transformers Can Do Arithmetic with the Right Embeddings

Arithmetic & Representations in Transformers

Several comments argue LLMs struggle with math because of how numbers and positions are encoded, not because transformers cannot learn arithmetic.
Lack of “column” structure and left‑to‑right output conflicts with right‑to‑left digit operations; this makes addition effectively harder (more than linear) for a vanilla transformer.
Thread notes that common tokenizers split numbers into odd chunks (e.g., “12345678” → “123”, “456”, “78”), obscuring digit positions and magnitudes.
Some suggest reversing numbers (least significant digit first) and adding position-aware embeddings; this is exactly what the paper and prior work explore.

What This Paper Adds (According to the Thread)

Introduces specialized positional/column embeddings for digits, effectively “number-flipping” and aligning digits by significance.
Shows sharp improvements on multi‑digit addition and some transfer to related tasks like sorting and limited multiplication.
Several commenters see it as evidence that better encodings remove key barriers and reveal transformers’ “logical extrapolation” abilities.

Skepticism and Limits

Critics see this as a narrow, hand‑engineered hack: strong inductive bias tailored to addition, with weak or no gains for subtraction/division.
Many stress 99% accuracy on 100‑digit arithmetic is useless for calculators or safety‑critical domains; real systems need external exact tools.
Some argue it doesn’t demonstrate genuine algorithm learning, just embedding design that partially bakes in the structure of addition.

Reasoning vs Pattern Matching

Ongoing debate: are transformers “reasoning” or just high‑dimensional curve fitters?
Pro‑side: success on controlled arithmetic tasks and broader benchmarks suggests nontrivial reasoning, albeit imperfect and fragile.
Skeptical side: failure to generalize from small to arbitrary‑length arithmetic and brittleness on reasoning benchmarks indicate poor, highly domain‑bound reasoning.

Broader Architectural & Evaluation Ideas

Several propose hybrids: LLM cores plus embedded ALU / CAS / search tools, perhaps via special tokens rather than plain text calls.
Others emphasize richer, task‑aware embeddings as the new “feature engineering” frontier.
Discussion touches on AGI definitions: calls for percentile‑based, task‑wise benchmarks instead of a binary “has AGI / hasn’t” framing.

Related topics