There Will Be a Scientific Theory of Deep Learning

Overall reactions to the paper

  • Many find the survey engaging, comprehensive, and particularly value the open-problems section.
  • Others see the title as overconfident “flag planting,” but still consider the work useful as a synthesis and standardization of ideas.
  • Several note a gap between active theory research and public perception that “it’s all just a black box.”

History: why deep learning took off when it did

  • Key inflection points cited: AlexNet (2012) for vision, then attention and transformers for language.
  • Crucial enabling factors: GPUs, much larger curated datasets (e.g., ImageNet), and better software frameworks that made complex models practical.
  • Some argue transformers “could have existed earlier,” but most replies stress results at small scale would have been underwhelming or infeasible to train.

Architecture vs scale and data

  • Strong debate:
    • One camp emphasizes “bagillions of parameters” and data as the main driver (the “bitter lesson”).
    • Another stresses architectural inductive biases and optimizer interactions; not all scalable architectures work, and many design choices are the difference between success and failure.
  • Neural nets are compared to “learned kernels” with powerful compositionality; nonparametric and kernel methods hit limits at modern data scales.

Relation to brains, evolution, and biology

  • Some argue deep learning is quite unlike brains; others see evolution as an end‑to‑end optimization process that pretrains brain structure.
  • There is discussion of local vs global learning rules (predictive coding vs backprop), and whether the brain approximates gradient descent.

Can there be a real “theory of deep learning”?

  • Optimists draw analogies to statistical mechanics, information geometry, and information‑theoretic views (implicit regularization, compression, scaling laws).
  • Skeptics doubt we’ll get physics‑like theories because behavior depends massively on messy data and huge models; they question whether concentration‑of‑measure style simplifications apply.
  • Some raise computability concerns (Rice’s theorem, Turing completeness), but others argue typical feed‑forward nets are not Turing complete, so classical impossibility results may not apply.

Interpretability, hallucination, and failure prediction

  • Several emphasize that theory matters most for predicting failure modes, confidence, and hallucination.
  • OOD detection is seen as conceptually shaky; alternative approaches based on model misspecification are being explored but are currently expensive and niche.

Models vs other ML approaches

  • Neural nets dominate unstructured data (images, audio, text), where their inductive biases match data structure.
  • Tree-based methods (often with boosting) remain superior on tabular data due to more suitable inductive biases there.