How linear regression works intuitively and how it leads to gradient descent
Geometric and Optimization Intuition
- Commenters extend the 1D derivative story to higher dimensions: stationary points require looking at the Hessian to distinguish minima, maxima, and saddle points.
- Several people like reframing regression as a geometric problem (fitting in parameter space, loss surfaces) to build intuition, including for gradient descent.
Squared vs Absolute Loss and Quantiles
- Multiple comments stress that least squares predicts the conditional mean; absolute error predicts the conditional median.
- Absolute loss (and more generally quantile regression) is defended as robust to outliers and useful when distributions are skewed or heavy‑tailed (e.g., housing with extreme prices).
- There’s pushback on the article’s negative tone about absolute loss: it’s “not perfect, but a trade-off.”
Why Squared Error? Gaussian Noise vs Convenience
- One camp argues squared error is mainly popular because it yields an analytic solution (OLS) and has a long historical/statistical tooling legacy.
- Another camp counters that its real justification is as the maximum-likelihood estimator under Gaussian noise and its BLUE properties, with the central limit theorem explaining why Gaussian errors are common.
- Some note that normality isn’t required for OLS to estimate conditional means or be “best linear unbiased”; non-normality is often less of a concern than misspecification (e.g., missing predictors like location).
Gradient Descent vs Closed-Form OLS
- Debate over whether OLS is a good example for introducing gradient descent given its closed form.
- Defenders say GD/SGD becomes preferable at large scale, in streaming/distributed settings, or with very high-dimensional data; critics suggest other numerical methods or randomized linear algebra often have nicer convergence.
- SGD’s “implicit regularization” is mentioned as a reason many practitioners favor stochastic methods, even for convex problems.
Statistics vs ML Culture and Practice
- Several contrasts: statistics as modeling and inference under uncertainty vs ML as black-box prediction; concern that ML intros often “butcher” OLS by ignoring assumptions, residuals, interpretation.
- Multicollinearity matters a lot in explanatory/statistical contexts but is often ignored when the goal is pure prediction.
- One maxim: applied statistics is about making decisions under uncertainty, not manufacturing certainty from data.
Model Fit, Transformations, and Alternatives
- The house-price example is criticized as heteroskedastic; suggestions include transforming the response (e.g., log-scale) or using weighted/iteratively reweighted least squares rather than assuming constant variance.
- Discussion distinguishes response transforms from “kernel tricks” (implicit high-dimensional feature maps) and more general feature engineering.
- Multiple linear regression, regularization (ridge, LASSO, elastic net), GLMs, and Deming regression are brought up as important but underemphasized extensions.
Interpretability, “Bitter Lesson,” and Intuition
- The thread connects simple linear methods to modern deep learning: both are “piles of linear algebra plus ReLUs,” and scaling plus data can trump hand-crafted structure.
- Some find this “bitter,” worrying about powerful but causally opaque systems (e.g., self-driving cars) whose correct and incorrect behavior may both be inexplicable. Others prioritize probabilistic behavior bounds and safety over human-understandable “why.”
- Intuitive teaching tools are highlighted: interactive visualizations, explorable explanations, spring-based physical analogies, and careful focus on data-generating assumptions rather than just optimization.