2025-05-05

How linear regression works intuitively and how it leads to gradient descent

Geometric and Optimization Intuition

Commenters extend the 1D derivative story to higher dimensions: stationary points require looking at the Hessian to distinguish minima, maxima, and saddle points.
Several people like reframing regression as a geometric problem (fitting in parameter space, loss surfaces) to build intuition, including for gradient descent.

Squared vs Absolute Loss and Quantiles

Multiple comments stress that least squares predicts the conditional mean; absolute error predicts the conditional median.
Absolute loss (and more generally quantile regression) is defended as robust to outliers and useful when distributions are skewed or heavy‑tailed (e.g., housing with extreme prices).
There’s pushback on the article’s negative tone about absolute loss: it’s “not perfect, but a trade-off.”

Why Squared Error? Gaussian Noise vs Convenience

One camp argues squared error is mainly popular because it yields an analytic solution (OLS) and has a long historical/statistical tooling legacy.
Another camp counters that its real justification is as the maximum-likelihood estimator under Gaussian noise and its BLUE properties, with the central limit theorem explaining why Gaussian errors are common.
Some note that normality isn’t required for OLS to estimate conditional means or be “best linear unbiased”; non-normality is often less of a concern than misspecification (e.g., missing predictors like location).

Gradient Descent vs Closed-Form OLS

Debate over whether OLS is a good example for introducing gradient descent given its closed form.
Defenders say GD/SGD becomes preferable at large scale, in streaming/distributed settings, or with very high-dimensional data; critics suggest other numerical methods or randomized linear algebra often have nicer convergence.
SGD’s “implicit regularization” is mentioned as a reason many practitioners favor stochastic methods, even for convex problems.

Statistics vs ML Culture and Practice

Several contrasts: statistics as modeling and inference under uncertainty vs ML as black-box prediction; concern that ML intros often “butcher” OLS by ignoring assumptions, residuals, interpretation.
Multicollinearity matters a lot in explanatory/statistical contexts but is often ignored when the goal is pure prediction.
One maxim: applied statistics is about making decisions under uncertainty, not manufacturing certainty from data.

Model Fit, Transformations, and Alternatives

The house-price example is criticized as heteroskedastic; suggestions include transforming the response (e.g., log-scale) or using weighted/iteratively reweighted least squares rather than assuming constant variance.
Discussion distinguishes response transforms from “kernel tricks” (implicit high-dimensional feature maps) and more general feature engineering.
Multiple linear regression, regularization (ridge, LASSO, elastic net), GLMs, and Deming regression are brought up as important but underemphasized extensions.

Interpretability, “Bitter Lesson,” and Intuition

The thread connects simple linear methods to modern deep learning: both are “piles of linear algebra plus ReLUs,” and scaling plus data can trump hand-crafted structure.
Some find this “bitter,” worrying about powerful but causally opaque systems (e.g., self-driving cars) whose correct and incorrect behavior may both be inexplicable. Others prioritize probabilistic behavior bounds and safety over human-understandable “why.”
Intuitive teaching tools are highlighted: interactive visualizations, explorable explanations, spring-based physical analogies, and careful focus on data-generating assumptions rather than just optimization.

Related topics