How does cosine similarity work?

Practical implementation notes

  • Cosine distance is available in common libraries, but users report SciPy’s distance module as slow, prone to overflow in mixed precision, and using suboptimal math functions; it’s fine for small datasets but not “big data.”
  • For JavaScript, multiple passes over standard arrays are seen as costly; TypedArray plus simple loops are recommended for speed and compatibility with native extensions.

Why cosine similarity is popular (especially in NLP)

  • In classic information retrieval, cosine on bag‑of‑words vectors naturally implements “length‑normalized word counting,” avoiding bias toward longer documents.
  • With word or sentence embeddings, cosine is interpreted as “how much the same features are active in both,” while ignoring overall scale.
  • Some see it as essentially “normalized dot product” and prefer that framing; others emphasize its geometric meaning as the cosine of the angle between vectors.

Normalization vs magnitude

  • Many practitioners normalize embeddings to unit length so cosine becomes a dot product; they ask whether magnitude ever matters.
  • Replies note that models (e.g., language models’ logits) use unnormalized dot products, so magnitude can encode information like token frequency.
  • Some applications use L2 distance where magnitude strongly affects similarity.
  • Debate arises over whether normalization “loses a dimension” or just discards magnitude information; the geometry of unit spheres and manifolds is discussed.

High-dimensional behavior

  • Multiple comments note that in high dimensions, random vectors tend to have cosine near zero and distances cluster, but cosine still works well for relative ranking.
  • There’s mention of work suggesting not normalizing in high‑dimensional ML spaces and of the general “curse of dimensionality,” though details are left vague.

Geometry vs abstraction

  • One camp insists we are doing geometry/trigonometry (angles, projections, spheres) and finds that intuitive, even in high dimensions.
  • Another camp prefers viewing vectors as feature lists or functions where cosine is just an inner product–based correlation, arguing that geometric imagery can mislead beyond 3D.

Alternatives and criticisms

  • Cosine similarity is criticized as status‑quo and sometimes misapplied, especially when vector magnitude carries important meaning or data include negatives, noise, or temporal/geospatial structure.
  • Suggested alternatives include Euclidean, L1/Manhattan, Chebyshev, Jaccard, Pearson correlation, Hamming, and problem‑specific metrics.
  • It’s noted that if all vectors have equal norm, cosine similarity and Euclidean distance induce the same nearest‑neighbor ordering.