2025-10-21

Why can't transformers learn multiplication?

Chain-of-thought (CoT) and why the toy transformers fail

The paper’s setup: numbers are tokenized digit-by-digit with least significant digit first to make addition “attention-friendly.”
Vanilla transformers trained only on A×B=C pairs fail to learn a generalizable multiplication algorithm, even though the architecture is, in principle, expressive enough.
When the model is first trained to emit explicit intermediate additions (a structured CoT) and those steps are gradually removed, it does learn to multiply.
Commenters summarize the takeaway as: the optimization process doesn’t discover good intermediate representations/algorithms on its own; CoT supervision nudges it out of bad local minima.

Language vs symbolic manipulation

Several comments argue multiplication is fundamentally symbolic/schematic, not something a “language model” is naturally good at—mirroring humans, who rely on external algorithms (paper, long multiplication) rather than pure linguistic intuition.
Others counter that human mathematics itself arose from language-based reasoning and symbolic manipulation; formalisms are just a stricter refinement of our linguistic capabilities.
There’s debate over whether expecting strong, length-generalizing arithmetic from a pure LM is like forcing the wrong tool for the job.

Representation, locality, and algorithm structure

One theme: addition with carries is “mostly local” in digit space, while multiplication is much more non-local and compositional, making it harder to learn as a sequence-to-sequence pattern.
Using least-significant-digit-first encoding makes addition easier; multiplication still requires discovering multi-step subroutines (partial products, carries, etc.).
Some suggest alternate schemes (log space, explicit numeric primitives, or numeric-first architectures) rather than learning math via token patterns.

Training vs learning; curriculum and evolution analogies

Multiple comments distinguish “training” (offline weight updates) from “learning” (online adaptation during use); current LMs mostly do the former.
Curriculum learning is raised as a human-like strategy: progressively harder tasks (letters → words → sentences; small numbers → bigger algorithms).
There’s discussion of whether architectures should be designed to continuously learn new paradigms (e.g., a major physics breakthrough) rather than requiring full retraining.

Probabilistic models vs deterministic tasks

One simplistic claim is that “probabilistic output” explains failure on deterministic multiplication; others rebut this, noting transformers can learn many deterministic functions (including addition) and can be run with zero temperature.
More nuanced view: exact arithmetic (like cryptography or banking balances) is “precision computing,” unlike the inherently tolerant, probabilistic nature of most ML tasks.
Even with temp=0, floating-point nondeterminism and accumulated small errors make long algorithmic chains brittle.

Tools, loops, and practical systems

Several commenters note that real systems can “shell out” to tools (calculators, code execution, CPU simulators), so the transformer need only orchestrate, not internally implement, exact multiplication.
Iterative use—running models in loops, having them leave notes, or maintain external state—can approximate algorithmic behavior but scales poorly when errors compound.
Overall sentiment: transformers can simulate arithmetic procedures to a degree (especially with CoT and tools), but using them as standalone exact multipliers exposes fundamental architectural and training limitations.

Related topics