2025-02-16

Softmax forever, or why I like softmax

Critique of the post’s treatment of the “Distance Logits” paper

Some object that dismissing a paper after noticing a single hyperparameter setting is understandable as a decision not to engage, but not a license for sloppy critique.
A key technical objection: the post assumes the distances ( |a_k| \approx 0 ) initially, but in the referenced paper the (a_k) are distances between vectors and unlikely to be near zero; thus the gradient issues near zero may be overstated.

Naming and mathematical framing of softmax

Several argue that log-sum-exp is the true “soft maximum” and should have been called softmax; the current “softmax” is really the gradient of log-sum-exp and might better be called “softargmax” or “grad softmax.”

Statistical mechanics, maximum entropy, and softmax

One line of discussion defends softmax via its Boltzmann-distribution roots: exponentials arise from counting microstates and maximizing entropy under constraints.
Others note that in ML, the interpretation of “energy,” fixed average energy, and temperature is often loosely applied or ignored, so the physical analogy is more motivational than fundamental.
There’s skepticism about the maximum entropy principle itself and whether it is uniquely justified or “natural.”

Alternative parameterizations of categorical distributions

Commenters stress that softmax is just one parametrization; other mappings from reals to categorical distributions can work.
A Bayesian/Dirichlet example is given: “add-one” (or more generally add-α) updating yields normalized probabilities with all outcomes nonzero, differing qualitatively from softmax’s tendency to push probabilities close to 0 or 1.

Reception of the explanation and usefulness of softmax

Some find the post’s explanation intuitive and helpful for understanding why naive mappings from logits to probabilities are hard for networks to learn.
Others feel the author overcomplicates things or is trying to show off.
Practitioners note softmax’s practical utility for turning arbitrary real-valued scores (including negatives) into a clean probability distribution, and point to related work on classifiers as energy-based models.

Strong reaction to all-lowercase style and writing norms

A large subthread criticizes the author’s refusal to capitalize as distracting, “cognitively disruptive,” and unprofessional for a technical essay.
Defenders frame lowercase as a generational/medium-specific norm originating in IM/SMS/Twitter, or as a legitimate stylistic choice on a personal blog.
Broader debate ensues about language evolution, reader vs writer optimization, autocapitalization on phones, and whether lowercase in serious writing is a passing fad or the future of informal text.

Related topics