How does cosine similarity work?
Practical implementation notes
- Cosine distance is available in common libraries, but users report SciPy’s distance module as slow, prone to overflow in mixed precision, and using suboptimal math functions; it’s fine for small datasets but not “big data.”
- For JavaScript, multiple passes over standard arrays are seen as costly;
TypedArrayplus simple loops are recommended for speed and compatibility with native extensions.
Why cosine similarity is popular (especially in NLP)
- In classic information retrieval, cosine on bag‑of‑words vectors naturally implements “length‑normalized word counting,” avoiding bias toward longer documents.
- With word or sentence embeddings, cosine is interpreted as “how much the same features are active in both,” while ignoring overall scale.
- Some see it as essentially “normalized dot product” and prefer that framing; others emphasize its geometric meaning as the cosine of the angle between vectors.
Normalization vs magnitude
- Many practitioners normalize embeddings to unit length so cosine becomes a dot product; they ask whether magnitude ever matters.
- Replies note that models (e.g., language models’ logits) use unnormalized dot products, so magnitude can encode information like token frequency.
- Some applications use L2 distance where magnitude strongly affects similarity.
- Debate arises over whether normalization “loses a dimension” or just discards magnitude information; the geometry of unit spheres and manifolds is discussed.
High-dimensional behavior
- Multiple comments note that in high dimensions, random vectors tend to have cosine near zero and distances cluster, but cosine still works well for relative ranking.
- There’s mention of work suggesting not normalizing in high‑dimensional ML spaces and of the general “curse of dimensionality,” though details are left vague.
Geometry vs abstraction
- One camp insists we are doing geometry/trigonometry (angles, projections, spheres) and finds that intuitive, even in high dimensions.
- Another camp prefers viewing vectors as feature lists or functions where cosine is just an inner product–based correlation, arguing that geometric imagery can mislead beyond 3D.
Alternatives and criticisms
- Cosine similarity is criticized as status‑quo and sometimes misapplied, especially when vector magnitude carries important meaning or data include negatives, noise, or temporal/geospatial structure.
- Suggested alternatives include Euclidean, L1/Manhattan, Chebyshev, Jaccard, Pearson correlation, Hamming, and problem‑specific metrics.
- It’s noted that if all vectors have equal norm, cosine similarity and Euclidean distance induce the same nearest‑neighbor ordering.