Who Invented Backpropagation?

Automatic differentiation, gradient descent, and backprop

  • Commenters distinguish:
    • Gradient descent (very old, “obvious” once you have gradients).
    • Automatic differentiation (AD) as an efficient way to compute gradients.
    • Backpropagation as reverse‑mode AD applied to neural networks.
  • Reverse‑mode AD:
    • Applies the chain rule “backwards,” caching intermediate values.
    • Is efficient for many-input / few-output functions (e.g., training).
    • Conceptually dual to forward mode, which is better for few-input / many-output.
  • Explanations compare reverse vs forward mode to memoized vs naive recursion, and to standard vector calculus derivations.

Control theory, Apollo, and adjoint methods

  • Several commenters link early backprop-like ideas to optimal control and adjoint/gradient methods from the 1960s:
    • Papers on optimal flight paths and lunar mission thrust programming using steepest descent and adjoint gradients.
    • Classic optimal control texts that derive a procedure essentially identical to backprop using Lagrange multipliers.
  • There is debate whether a popular essay’s line about “optimizing Apollo thrusts” referred specifically to backprop or more generally to control theory.
  • Some note that many neural nets can be cast as state‑space systems, but say that reframing learning as optimal control is usually not practically useful.

“Just the chain rule?” Novelty vs triviality

  • One camp: backprop is “just the chain rule,” so asking who invented it is uninteresting; any 17th‑century calculus inventor could have done it.
  • Counterpoint (echoing the article): the novelty is the efficient application of the chain rule to large computation graphs; many inefficient ways exist.
  • There’s a technical side debate:
    • One view: symbolic differentiation and AD are fundamentally different, and naive symbolic methods blow up in expression size.
    • Opposing view: with DAG representations and common subexpression elimination, symbolic and AD are effectively equivalent implementations of the same math.

Attribution fights and awards

  • Multiple commenters say backprop has been “invented and forgotten” many times; they question the value of awarding priority at all.
  • Others argue that careful historical credit matters, especially for overlooked groups (e.g., Japanese researchers).
  • The article’s author is seen by some as doing serious archival work; others see it as “sour grapes” about major prizes for deep learning pioneers.
  • There’s extended back-and-forth about:
    • Whether certain AI researchers deserved a Nobel in physics or only a computing award.
    • Whether the physics community actually views ML contributions as worthy physics.
    • The broader pattern of a North American establishment over‑crediting its own.

Why backprop mattered late

  • Commenters note that neural networks and backprop were long viewed skeptically because deep nets were hard to train.
  • They emphasize that:
    • Backprop alone wasn’t enough; practical success required architecture innovations (CNNs, recurrent variants, transformers), better optimizers, activation functions, and mitigation of exploding/vanishing gradients.
    • GPU computing and differentiable-programming frameworks (Theano, TensorFlow, PyTorch, JAX) were major enabling factors.
  • Some share personal anecdotes of early enthusiasm for NNs, evolutionary training, and regret at leaving AI before the 2010s deep-learning boom.