Why can't transformers learn multiplication?
Chain-of-thought (CoT) and why the toy transformers fail
- The paper’s setup: numbers are tokenized digit-by-digit with least significant digit first to make addition “attention-friendly.”
- Vanilla transformers trained only on
A×B=Cpairs fail to learn a generalizable multiplication algorithm, even though the architecture is, in principle, expressive enough. - When the model is first trained to emit explicit intermediate additions (a structured CoT) and those steps are gradually removed, it does learn to multiply.
- Commenters summarize the takeaway as: the optimization process doesn’t discover good intermediate representations/algorithms on its own; CoT supervision nudges it out of bad local minima.
Language vs symbolic manipulation
- Several comments argue multiplication is fundamentally symbolic/schematic, not something a “language model” is naturally good at—mirroring humans, who rely on external algorithms (paper, long multiplication) rather than pure linguistic intuition.
- Others counter that human mathematics itself arose from language-based reasoning and symbolic manipulation; formalisms are just a stricter refinement of our linguistic capabilities.
- There’s debate over whether expecting strong, length-generalizing arithmetic from a pure LM is like forcing the wrong tool for the job.
Representation, locality, and algorithm structure
- One theme: addition with carries is “mostly local” in digit space, while multiplication is much more non-local and compositional, making it harder to learn as a sequence-to-sequence pattern.
- Using least-significant-digit-first encoding makes addition easier; multiplication still requires discovering multi-step subroutines (partial products, carries, etc.).
- Some suggest alternate schemes (log space, explicit numeric primitives, or numeric-first architectures) rather than learning math via token patterns.
Training vs learning; curriculum and evolution analogies
- Multiple comments distinguish “training” (offline weight updates) from “learning” (online adaptation during use); current LMs mostly do the former.
- Curriculum learning is raised as a human-like strategy: progressively harder tasks (letters → words → sentences; small numbers → bigger algorithms).
- There’s discussion of whether architectures should be designed to continuously learn new paradigms (e.g., a major physics breakthrough) rather than requiring full retraining.
Probabilistic models vs deterministic tasks
- One simplistic claim is that “probabilistic output” explains failure on deterministic multiplication; others rebut this, noting transformers can learn many deterministic functions (including addition) and can be run with zero temperature.
- More nuanced view: exact arithmetic (like cryptography or banking balances) is “precision computing,” unlike the inherently tolerant, probabilistic nature of most ML tasks.
- Even with temp=0, floating-point nondeterminism and accumulated small errors make long algorithmic chains brittle.
Tools, loops, and practical systems
- Several commenters note that real systems can “shell out” to tools (calculators, code execution, CPU simulators), so the transformer need only orchestrate, not internally implement, exact multiplication.
- Iterative use—running models in loops, having them leave notes, or maintain external state—can approximate algorithmic behavior but scales poorly when errors compound.
- Overall sentiment: transformers can simulate arithmetic procedures to a degree (especially with CoT and tools), but using them as standalone exact multipliers exposes fundamental architectural and training limitations.