Deep Learning Is Not So Mysterious or Different

Competing Explanations for Generalization

  • One thread argues PAC-Bayes / VC-style hypothesis-space bounds (as in the linked paper) can already explain deep learning’s “benign overfitting” with flexible hypothesis spaces plus a simplicity bias.
  • A dissenting view claims this is insufficient after results like Zhang et al. showing the same network can fit both real and random labels; hence focusing only on hypothesis space is too coarse.
  • That camp pushes algorithmic stability and optimization dynamics (especially SGD) as the key: you must explain why training lands in a good subspace among many zero-loss but bad-generalization solutions.
  • Others mention statistical mechanics and loss landscapes as useful lenses; there is disagreement on whether optimizer details are central or historically overstated.

Simplicity Bias and Regularization in Deep Learning

  • Several comments map the paper’s “soft preference for simpler solutions” to standard regularization:
    • L1/L2 penalties, dropout (roughly like layerwise L2), AdamW weight decay.
    • Architectural and initialization choices as “soft inductive bias” (e.g., special ViT initialization).
  • Some note equivalences: L2 ↔ dimensionality reduction/smoothness; dropout ↔ L2; L1 ↔ thresholding/RELU-like behavior.

Depth, Architectures, and Inductive Bias

  • Example cited from recent RNN work: shallow minimal RNNs cannot capture long-range ordered dependencies, but deeper (≥3-layer) versions can, highlighting cases where “deep” structure is genuinely necessary.

N‑gram Models vs Modern LLMs

  • A proposed word-distance counting scheme is likened to classic n‑gram/Markov models.
  • Multiple replies: such models scale poorly (combinatorial explosion, sparsity) and produce much weaker, often incoherent outputs compared to transformers.
  • Attention and learned embeddings are emphasized as key differences enabling generalization beyond seen n‑grams; some point to scaled-up n‑gram research as a partial bridge.

Is Deep Learning Really a “Black Box”?

  • One thread insists nothing is truly mysterious: every transistor state is determinate.
  • Others counter that “black box” here means “too complex for any human to fully understand in detail,” not “fundamentally unknowable.”
  • Weights are highlighted as opaque artifacts of training: not hand-designed and hard to interpret neuron-by-neuron.
  • Several comments say the real mystery is how information is encoded in parameters and why performance scales so smoothly (e.g., via near-orthogonal representations and superposition).

Learning Resources and Intuitions

  • Strong enthusiasm for approachable resources: StatQuest (book and videos), Stanford CS109, Caltech “Learning from Data,” and 3Blue1Brown.
  • Some note that understanding the universal approximation theorem and viewing neurons as (generalized) linear models plus nonlinearities helps demystify networks, though emergent behavior is far more complex than that slogan.

Generalization, Data Scale, and Benign Overfitting

  • One view: DNNs are not inherently superior at generalization; on small tabular datasets, classical methods (e.g., SVMs) often outperform, while deep nets overfit.
  • Same commenter attributes LLM “magic” to enormous effective sample sizes in next-token prediction, enabling huge models without classical overfitting, plus reusability of learned representations across tasks.
  • Others respond with work showing networks that can memorize random labels still generalize well on real data, reinforcing that something nontrivial about training dynamics or inductive bias is at play.
  • Whether the simplicity bias comes mainly from explicit regularization, SGD’s implicit bias, architecture, or loss landscape remains contested and described as not yet fully understood.

Open and Unanswered Questions

  • A question about where the regulatory line for “AI” should be drawn receives no substantive answer.
  • One commenter explicitly calls for a clean ablation study that “turns off” benign overfitting in deep nets to isolate the necessary and sufficient conditions; they note this has not yet been convincingly achieved.