2026-04-24

There Will Be a Scientific Theory of Deep Learning

Overall reactions to the paper

Many find the survey engaging, comprehensive, and particularly value the open-problems section.
Others see the title as overconfident “flag planting,” but still consider the work useful as a synthesis and standardization of ideas.
Several note a gap between active theory research and public perception that “it’s all just a black box.”

History: why deep learning took off when it did

Key inflection points cited: AlexNet (2012) for vision, then attention and transformers for language.
Crucial enabling factors: GPUs, much larger curated datasets (e.g., ImageNet), and better software frameworks that made complex models practical.
Some argue transformers “could have existed earlier,” but most replies stress results at small scale would have been underwhelming or infeasible to train.

Architecture vs scale and data

Strong debate:
- One camp emphasizes “bagillions of parameters” and data as the main driver (the “bitter lesson”).
- Another stresses architectural inductive biases and optimizer interactions; not all scalable architectures work, and many design choices are the difference between success and failure.
Neural nets are compared to “learned kernels” with powerful compositionality; nonparametric and kernel methods hit limits at modern data scales.

Relation to brains, evolution, and biology

Some argue deep learning is quite unlike brains; others see evolution as an end‑to‑end optimization process that pretrains brain structure.
There is discussion of local vs global learning rules (predictive coding vs backprop), and whether the brain approximates gradient descent.

Can there be a real “theory of deep learning”?

Optimists draw analogies to statistical mechanics, information geometry, and information‑theoretic views (implicit regularization, compression, scaling laws).
Skeptics doubt we’ll get physics‑like theories because behavior depends massively on messy data and huge models; they question whether concentration‑of‑measure style simplifications apply.
Some raise computability concerns (Rice’s theorem, Turing completeness), but others argue typical feed‑forward nets are not Turing complete, so classical impossibility results may not apply.

Interpretability, hallucination, and failure prediction

Several emphasize that theory matters most for predicting failure modes, confidence, and hallucination.
OOD detection is seen as conceptually shaky; alternative approaches based on model misspecification are being explored but are currently expensive and niche.

Models vs other ML approaches

Neural nets dominate unstructured data (images, audio, text), where their inductive biases match data structure.
Tree-based methods (often with boosting) remain superior on tabular data due to more suitable inductive biases there.

Related topics