2024-05-16

Computer scientists invent an efficient new way to count

Relationship to existing algorithms

Many compare CVM to HyperLogLog and related “sketch” structures:
- Same high-level goal: approximate distinct counts over large streams with bounded memory.
- HLL supports unions and (with care) intersections and is widely used in distributed systems; CVM currently lacks good merge/union semantics.
- CVM is praised as conceptually simpler and more textbook-friendly, but generally seen as not strictly state of the art compared to tuned HLL variants.
Others relate it to reservoir sampling, BJKST, Bloom filters, and top‑k frequency algorithms. The key link: assigning each distinct item a random “key” and managing a small representative sample.

How the algorithm works (intuitions)

Intuition offered: each distinct element behaves as if it gets a single random number; the algorithm keeps a set of elements whose “effective” random numbers are below a threshold that is halved over rounds.
The probability an element survives depends only on its last occurrence, so high-frequency items don’t bias the estimate.
Conceptually like a floating‑point counter: an exponent (number of halvings) plus mantissa (current set size).

Implementation details & corrections

Several readers implemented it (Python, Go, Nim, JS, PHP) and found it very short and fast.
Multiple commenters highlight that the Quanta article’s prose description is wrong: it suggests flipping only for duplicates, whereas the algorithm must (effectively) flip for every occurrence or equivalently always delete-then-reinsert-with‑probability.
A rare “no element deleted” case in the shrink step should be handled with a loop rather than a single if; this also yields an unbiased estimator.

Accuracy, parameters, and theory

Error is controlled via ε (relative error) and δ (failure probability), using Chernoff-style bounds.
Theoretical thresh depends on stream length m; this is seen as impractical when m is unknown. Workarounds discussed:
- Choose a pessimistic m and later solve for the actual ε.
- Use follow‑up variants that remove explicit dependence on m.
Empirical tests show good accuracy when the memory threshold is a modest fraction of the true distinct count; theoretical bounds are thought to be conservative.

Practical pros, cons, and open questions

Pros: tiny, simple implementation; very fast (sometimes IO-bound); accessible proofs; works for general “structured sets” where good hash families are unknown.
Cons / limitations:
- Needs to store actual elements (or hashes), so memory is O(table size × element size).
- No native deletions, unions, or intersections; merging is an open problem.
- Heavier CPU than naive counting due to frequent random draws, though the sketch is small.
Debate over the headline: many stress this is estimation, not exact counting; others argue in large-scale practice “counting” is almost always approximate anyway.

Meta: communication and pedagogy

Several prefer the original paper or later technical note over the popular article, citing clearer pseudocode and correct details.
The algorithm is praised as an elegant “for the textbook” example of lateral thinking and probabilistic methods in streaming algorithms.