2024-09-04

How does cosine similarity work?

Practical implementation notes

Cosine distance is available in common libraries, but users report SciPy’s distance module as slow, prone to overflow in mixed precision, and using suboptimal math functions; it’s fine for small datasets but not “big data.”
For JavaScript, multiple passes over standard arrays are seen as costly; TypedArray plus simple loops are recommended for speed and compatibility with native extensions.

Why cosine similarity is popular (especially in NLP)

In classic information retrieval, cosine on bag‑of‑words vectors naturally implements “length‑normalized word counting,” avoiding bias toward longer documents.
With word or sentence embeddings, cosine is interpreted as “how much the same features are active in both,” while ignoring overall scale.
Some see it as essentially “normalized dot product” and prefer that framing; others emphasize its geometric meaning as the cosine of the angle between vectors.

Normalization vs magnitude

Many practitioners normalize embeddings to unit length so cosine becomes a dot product; they ask whether magnitude ever matters.
Replies note that models (e.g., language models’ logits) use unnormalized dot products, so magnitude can encode information like token frequency.
Some applications use L2 distance where magnitude strongly affects similarity.
Debate arises over whether normalization “loses a dimension” or just discards magnitude information; the geometry of unit spheres and manifolds is discussed.

High-dimensional behavior

Multiple comments note that in high dimensions, random vectors tend to have cosine near zero and distances cluster, but cosine still works well for relative ranking.
There’s mention of work suggesting not normalizing in high‑dimensional ML spaces and of the general “curse of dimensionality,” though details are left vague.

Geometry vs abstraction

One camp insists we are doing geometry/trigonometry (angles, projections, spheres) and finds that intuitive, even in high dimensions.
Another camp prefers viewing vectors as feature lists or functions where cosine is just an inner product–based correlation, arguing that geometric imagery can mislead beyond 3D.

Alternatives and criticisms

Cosine similarity is criticized as status‑quo and sometimes misapplied, especially when vector magnitude carries important meaning or data include negatives, noise, or temporal/geospatial structure.
Suggested alternatives include Euclidean, L1/Manhattan, Chebyshev, Jaccard, Pearson correlation, Hamming, and problem‑specific metrics.
It’s noted that if all vectors have equal norm, cosine similarity and Euclidean distance induce the same nearest‑neighbor ordering.

Related topics