Computer scientists invent an efficient new way to count

Relationship to existing algorithms

  • Many compare CVM to HyperLogLog and related “sketch” structures:
    • Same high-level goal: approximate distinct counts over large streams with bounded memory.
    • HLL supports unions and (with care) intersections and is widely used in distributed systems; CVM currently lacks good merge/union semantics.
    • CVM is praised as conceptually simpler and more textbook-friendly, but generally seen as not strictly state of the art compared to tuned HLL variants.
  • Others relate it to reservoir sampling, BJKST, Bloom filters, and top‑k frequency algorithms. The key link: assigning each distinct item a random “key” and managing a small representative sample.

How the algorithm works (intuitions)

  • Intuition offered: each distinct element behaves as if it gets a single random number; the algorithm keeps a set of elements whose “effective” random numbers are below a threshold that is halved over rounds.
  • The probability an element survives depends only on its last occurrence, so high-frequency items don’t bias the estimate.
  • Conceptually like a floating‑point counter: an exponent (number of halvings) plus mantissa (current set size).

Implementation details & corrections

  • Several readers implemented it (Python, Go, Nim, JS, PHP) and found it very short and fast.
  • Multiple commenters highlight that the Quanta article’s prose description is wrong: it suggests flipping only for duplicates, whereas the algorithm must (effectively) flip for every occurrence or equivalently always delete-then-reinsert-with‑probability.
  • A rare “no element deleted” case in the shrink step should be handled with a loop rather than a single if; this also yields an unbiased estimator.

Accuracy, parameters, and theory

  • Error is controlled via ε (relative error) and δ (failure probability), using Chernoff-style bounds.
  • Theoretical thresh depends on stream length m; this is seen as impractical when m is unknown. Workarounds discussed:
    • Choose a pessimistic m and later solve for the actual ε.
    • Use follow‑up variants that remove explicit dependence on m.
  • Empirical tests show good accuracy when the memory threshold is a modest fraction of the true distinct count; theoretical bounds are thought to be conservative.

Practical pros, cons, and open questions

  • Pros: tiny, simple implementation; very fast (sometimes IO-bound); accessible proofs; works for general “structured sets” where good hash families are unknown.
  • Cons / limitations:
    • Needs to store actual elements (or hashes), so memory is O(table size × element size).
    • No native deletions, unions, or intersections; merging is an open problem.
    • Heavier CPU than naive counting due to frequent random draws, though the sketch is small.
  • Debate over the headline: many stress this is estimation, not exact counting; others argue in large-scale practice “counting” is almost always approximate anyway.

Meta: communication and pedagogy

  • Several prefer the original paper or later technical note over the popular article, citing clearer pseudocode and correct details.
  • The algorithm is praised as an elegant “for the textbook” example of lateral thinking and probabilistic methods in streaming algorithms.