Why can't transformers learn multiplication?

Chain-of-thought (CoT) and why the toy transformers fail

  • The paper’s setup: numbers are tokenized digit-by-digit with least significant digit first to make addition “attention-friendly.”
  • Vanilla transformers trained only on A×B=C pairs fail to learn a generalizable multiplication algorithm, even though the architecture is, in principle, expressive enough.
  • When the model is first trained to emit explicit intermediate additions (a structured CoT) and those steps are gradually removed, it does learn to multiply.
  • Commenters summarize the takeaway as: the optimization process doesn’t discover good intermediate representations/algorithms on its own; CoT supervision nudges it out of bad local minima.

Language vs symbolic manipulation

  • Several comments argue multiplication is fundamentally symbolic/schematic, not something a “language model” is naturally good at—mirroring humans, who rely on external algorithms (paper, long multiplication) rather than pure linguistic intuition.
  • Others counter that human mathematics itself arose from language-based reasoning and symbolic manipulation; formalisms are just a stricter refinement of our linguistic capabilities.
  • There’s debate over whether expecting strong, length-generalizing arithmetic from a pure LM is like forcing the wrong tool for the job.

Representation, locality, and algorithm structure

  • One theme: addition with carries is “mostly local” in digit space, while multiplication is much more non-local and compositional, making it harder to learn as a sequence-to-sequence pattern.
  • Using least-significant-digit-first encoding makes addition easier; multiplication still requires discovering multi-step subroutines (partial products, carries, etc.).
  • Some suggest alternate schemes (log space, explicit numeric primitives, or numeric-first architectures) rather than learning math via token patterns.

Training vs learning; curriculum and evolution analogies

  • Multiple comments distinguish “training” (offline weight updates) from “learning” (online adaptation during use); current LMs mostly do the former.
  • Curriculum learning is raised as a human-like strategy: progressively harder tasks (letters → words → sentences; small numbers → bigger algorithms).
  • There’s discussion of whether architectures should be designed to continuously learn new paradigms (e.g., a major physics breakthrough) rather than requiring full retraining.

Probabilistic models vs deterministic tasks

  • One simplistic claim is that “probabilistic output” explains failure on deterministic multiplication; others rebut this, noting transformers can learn many deterministic functions (including addition) and can be run with zero temperature.
  • More nuanced view: exact arithmetic (like cryptography or banking balances) is “precision computing,” unlike the inherently tolerant, probabilistic nature of most ML tasks.
  • Even with temp=0, floating-point nondeterminism and accumulated small errors make long algorithmic chains brittle.

Tools, loops, and practical systems

  • Several commenters note that real systems can “shell out” to tools (calculators, code execution, CPU simulators), so the transformer need only orchestrate, not internally implement, exact multiplication.
  • Iterative use—running models in loops, having them leave notes, or maintain external state—can approximate algorithmic behavior but scales poorly when errors compound.
  • Overall sentiment: transformers can simulate arithmetic procedures to a degree (especially with CoT and tools), but using them as standalone exact multipliers exposes fundamental architectural and training limitations.