How linear regression works intuitively and how it leads to gradient descent

Geometric and Optimization Intuition

  • Commenters extend the 1D derivative story to higher dimensions: stationary points require looking at the Hessian to distinguish minima, maxima, and saddle points.
  • Several people like reframing regression as a geometric problem (fitting in parameter space, loss surfaces) to build intuition, including for gradient descent.

Squared vs Absolute Loss and Quantiles

  • Multiple comments stress that least squares predicts the conditional mean; absolute error predicts the conditional median.
  • Absolute loss (and more generally quantile regression) is defended as robust to outliers and useful when distributions are skewed or heavy‑tailed (e.g., housing with extreme prices).
  • There’s pushback on the article’s negative tone about absolute loss: it’s “not perfect, but a trade-off.”

Why Squared Error? Gaussian Noise vs Convenience

  • One camp argues squared error is mainly popular because it yields an analytic solution (OLS) and has a long historical/statistical tooling legacy.
  • Another camp counters that its real justification is as the maximum-likelihood estimator under Gaussian noise and its BLUE properties, with the central limit theorem explaining why Gaussian errors are common.
  • Some note that normality isn’t required for OLS to estimate conditional means or be “best linear unbiased”; non-normality is often less of a concern than misspecification (e.g., missing predictors like location).

Gradient Descent vs Closed-Form OLS

  • Debate over whether OLS is a good example for introducing gradient descent given its closed form.
  • Defenders say GD/SGD becomes preferable at large scale, in streaming/distributed settings, or with very high-dimensional data; critics suggest other numerical methods or randomized linear algebra often have nicer convergence.
  • SGD’s “implicit regularization” is mentioned as a reason many practitioners favor stochastic methods, even for convex problems.

Statistics vs ML Culture and Practice

  • Several contrasts: statistics as modeling and inference under uncertainty vs ML as black-box prediction; concern that ML intros often “butcher” OLS by ignoring assumptions, residuals, interpretation.
  • Multicollinearity matters a lot in explanatory/statistical contexts but is often ignored when the goal is pure prediction.
  • One maxim: applied statistics is about making decisions under uncertainty, not manufacturing certainty from data.

Model Fit, Transformations, and Alternatives

  • The house-price example is criticized as heteroskedastic; suggestions include transforming the response (e.g., log-scale) or using weighted/iteratively reweighted least squares rather than assuming constant variance.
  • Discussion distinguishes response transforms from “kernel tricks” (implicit high-dimensional feature maps) and more general feature engineering.
  • Multiple linear regression, regularization (ridge, LASSO, elastic net), GLMs, and Deming regression are brought up as important but underemphasized extensions.

Interpretability, “Bitter Lesson,” and Intuition

  • The thread connects simple linear methods to modern deep learning: both are “piles of linear algebra plus ReLUs,” and scaling plus data can trump hand-crafted structure.
  • Some find this “bitter,” worrying about powerful but causally opaque systems (e.g., self-driving cars) whose correct and incorrect behavior may both be inexplicable. Others prioritize probabilistic behavior bounds and safety over human-understandable “why.”
  • Intuitive teaching tools are highlighted: interactive visualizations, explorable explanations, spring-based physical analogies, and careful focus on data-generating assumptions rather than just optimization.