There Will Be a Scientific Theory of Deep Learning
Overall reactions to the paper
- Many find the survey engaging, comprehensive, and particularly value the open-problems section.
- Others see the title as overconfident “flag planting,” but still consider the work useful as a synthesis and standardization of ideas.
- Several note a gap between active theory research and public perception that “it’s all just a black box.”
History: why deep learning took off when it did
- Key inflection points cited: AlexNet (2012) for vision, then attention and transformers for language.
- Crucial enabling factors: GPUs, much larger curated datasets (e.g., ImageNet), and better software frameworks that made complex models practical.
- Some argue transformers “could have existed earlier,” but most replies stress results at small scale would have been underwhelming or infeasible to train.
Architecture vs scale and data
- Strong debate:
- One camp emphasizes “bagillions of parameters” and data as the main driver (the “bitter lesson”).
- Another stresses architectural inductive biases and optimizer interactions; not all scalable architectures work, and many design choices are the difference between success and failure.
- Neural nets are compared to “learned kernels” with powerful compositionality; nonparametric and kernel methods hit limits at modern data scales.
Relation to brains, evolution, and biology
- Some argue deep learning is quite unlike brains; others see evolution as an end‑to‑end optimization process that pretrains brain structure.
- There is discussion of local vs global learning rules (predictive coding vs backprop), and whether the brain approximates gradient descent.
Can there be a real “theory of deep learning”?
- Optimists draw analogies to statistical mechanics, information geometry, and information‑theoretic views (implicit regularization, compression, scaling laws).
- Skeptics doubt we’ll get physics‑like theories because behavior depends massively on messy data and huge models; they question whether concentration‑of‑measure style simplifications apply.
- Some raise computability concerns (Rice’s theorem, Turing completeness), but others argue typical feed‑forward nets are not Turing complete, so classical impossibility results may not apply.
Interpretability, hallucination, and failure prediction
- Several emphasize that theory matters most for predicting failure modes, confidence, and hallucination.
- OOD detection is seen as conceptually shaky; alternative approaches based on model misspecification are being explored but are currently expensive and niche.
Models vs other ML approaches
- Neural nets dominate unstructured data (images, audio, text), where their inductive biases match data structure.
- Tree-based methods (often with boosting) remain superior on tabular data due to more suitable inductive biases there.