Computer scientists invent an efficient new way to count
Relationship to existing algorithms
- Many compare CVM to HyperLogLog and related “sketch” structures:
- Same high-level goal: approximate distinct counts over large streams with bounded memory.
- HLL supports unions and (with care) intersections and is widely used in distributed systems; CVM currently lacks good merge/union semantics.
- CVM is praised as conceptually simpler and more textbook-friendly, but generally seen as not strictly state of the art compared to tuned HLL variants.
- Others relate it to reservoir sampling, BJKST, Bloom filters, and top‑k frequency algorithms. The key link: assigning each distinct item a random “key” and managing a small representative sample.
How the algorithm works (intuitions)
- Intuition offered: each distinct element behaves as if it gets a single random number; the algorithm keeps a set of elements whose “effective” random numbers are below a threshold that is halved over rounds.
- The probability an element survives depends only on its last occurrence, so high-frequency items don’t bias the estimate.
- Conceptually like a floating‑point counter: an exponent (number of halvings) plus mantissa (current set size).
Implementation details & corrections
- Several readers implemented it (Python, Go, Nim, JS, PHP) and found it very short and fast.
- Multiple commenters highlight that the Quanta article’s prose description is wrong: it suggests flipping only for duplicates, whereas the algorithm must (effectively) flip for every occurrence or equivalently always delete-then-reinsert-with‑probability.
- A rare “no element deleted” case in the shrink step should be handled with a loop rather than a single if; this also yields an unbiased estimator.
Accuracy, parameters, and theory
- Error is controlled via ε (relative error) and δ (failure probability), using Chernoff-style bounds.
- Theoretical thresh depends on stream length m; this is seen as impractical when m is unknown. Workarounds discussed:
- Choose a pessimistic m and later solve for the actual ε.
- Use follow‑up variants that remove explicit dependence on m.
- Empirical tests show good accuracy when the memory threshold is a modest fraction of the true distinct count; theoretical bounds are thought to be conservative.
Practical pros, cons, and open questions
- Pros: tiny, simple implementation; very fast (sometimes IO-bound); accessible proofs; works for general “structured sets” where good hash families are unknown.
- Cons / limitations:
- Needs to store actual elements (or hashes), so memory is O(table size × element size).
- No native deletions, unions, or intersections; merging is an open problem.
- Heavier CPU than naive counting due to frequent random draws, though the sketch is small.
- Debate over the headline: many stress this is estimation, not exact counting; others argue in large-scale practice “counting” is almost always approximate anyway.
Meta: communication and pedagogy
- Several prefer the original paper or later technical note over the popular article, citing clearer pseudocode and correct details.
- The algorithm is praised as an elegant “for the textbook” example of lateral thinking and probabilistic methods in streaming algorithms.