Softmax forever, or why I like softmax

Critique of the post’s treatment of the “Distance Logits” paper

  • Some object that dismissing a paper after noticing a single hyperparameter setting is understandable as a decision not to engage, but not a license for sloppy critique.
  • A key technical objection: the post assumes the distances ( |a_k| \approx 0 ) initially, but in the referenced paper the (a_k) are distances between vectors and unlikely to be near zero; thus the gradient issues near zero may be overstated.

Naming and mathematical framing of softmax

  • Several argue that log-sum-exp is the true “soft maximum” and should have been called softmax; the current “softmax” is really the gradient of log-sum-exp and might better be called “softargmax” or “grad softmax.”

Statistical mechanics, maximum entropy, and softmax

  • One line of discussion defends softmax via its Boltzmann-distribution roots: exponentials arise from counting microstates and maximizing entropy under constraints.
  • Others note that in ML, the interpretation of “energy,” fixed average energy, and temperature is often loosely applied or ignored, so the physical analogy is more motivational than fundamental.
  • There’s skepticism about the maximum entropy principle itself and whether it is uniquely justified or “natural.”

Alternative parameterizations of categorical distributions

  • Commenters stress that softmax is just one parametrization; other mappings from reals to categorical distributions can work.
  • A Bayesian/Dirichlet example is given: “add-one” (or more generally add-α) updating yields normalized probabilities with all outcomes nonzero, differing qualitatively from softmax’s tendency to push probabilities close to 0 or 1.

Reception of the explanation and usefulness of softmax

  • Some find the post’s explanation intuitive and helpful for understanding why naive mappings from logits to probabilities are hard for networks to learn.
  • Others feel the author overcomplicates things or is trying to show off.
  • Practitioners note softmax’s practical utility for turning arbitrary real-valued scores (including negatives) into a clean probability distribution, and point to related work on classifiers as energy-based models.

Strong reaction to all-lowercase style and writing norms

  • A large subthread criticizes the author’s refusal to capitalize as distracting, “cognitively disruptive,” and unprofessional for a technical essay.
  • Defenders frame lowercase as a generational/medium-specific norm originating in IM/SMS/Twitter, or as a legitimate stylistic choice on a personal blog.
  • Broader debate ensues about language evolution, reader vs writer optimization, autocapitalization on phones, and whether lowercase in serious writing is a passing fad or the future of informal text.