Softmax forever, or why I like softmax
Critique of the post’s treatment of the “Distance Logits” paper
- Some object that dismissing a paper after noticing a single hyperparameter setting is understandable as a decision not to engage, but not a license for sloppy critique.
- A key technical objection: the post assumes the distances ( |a_k| \approx 0 ) initially, but in the referenced paper the (a_k) are distances between vectors and unlikely to be near zero; thus the gradient issues near zero may be overstated.
Naming and mathematical framing of softmax
- Several argue that log-sum-exp is the true “soft maximum” and should have been called softmax; the current “softmax” is really the gradient of log-sum-exp and might better be called “softargmax” or “grad softmax.”
Statistical mechanics, maximum entropy, and softmax
- One line of discussion defends softmax via its Boltzmann-distribution roots: exponentials arise from counting microstates and maximizing entropy under constraints.
- Others note that in ML, the interpretation of “energy,” fixed average energy, and temperature is often loosely applied or ignored, so the physical analogy is more motivational than fundamental.
- There’s skepticism about the maximum entropy principle itself and whether it is uniquely justified or “natural.”
Alternative parameterizations of categorical distributions
- Commenters stress that softmax is just one parametrization; other mappings from reals to categorical distributions can work.
- A Bayesian/Dirichlet example is given: “add-one” (or more generally add-α) updating yields normalized probabilities with all outcomes nonzero, differing qualitatively from softmax’s tendency to push probabilities close to 0 or 1.
Reception of the explanation and usefulness of softmax
- Some find the post’s explanation intuitive and helpful for understanding why naive mappings from logits to probabilities are hard for networks to learn.
- Others feel the author overcomplicates things or is trying to show off.
- Practitioners note softmax’s practical utility for turning arbitrary real-valued scores (including negatives) into a clean probability distribution, and point to related work on classifiers as energy-based models.
Strong reaction to all-lowercase style and writing norms
- A large subthread criticizes the author’s refusal to capitalize as distracting, “cognitively disruptive,” and unprofessional for a technical essay.
- Defenders frame lowercase as a generational/medium-specific norm originating in IM/SMS/Twitter, or as a legitimate stylistic choice on a personal blog.
- Broader debate ensues about language evolution, reader vs writer optimization, autocapitalization on phones, and whether lowercase in serious writing is a passing fad or the future of informal text.