2025-05-08

Reservoir Sampling

Interview Question Experiences

Many commenters saw reservoir sampling as a classic big-tech interview question.
Some passed by already knowing the algorithm; others “floundered” trying to derive it under time pressure.
Debate over fairness of such questions: several doubt a smart person could derive the algorithm from scratch in 60 minutes, while others state interviewers expected solid reasoning, not necessarily reinvention—though at least one commenter says failing the exact acceptance rule meant failing the question.

Practical Applications & Variants

Uses mentioned:
- Choosing shard split points for lexicographically sorted keys.
- Sampling tape backups or hospital records.
- Maintaining “recently active” items (e.g., chess boards).
- Coreutils’ shuf and metrics libraries.
- Graphics: weighted reservoir sampling in ReSTIR for real‑time ray tracing.
Alternate formulations:
- Assign each item a random priority and keep the top‑k.
- Weighted versions using transformations like pow(random, 1/weight); stability and distribution-accuracy caveats noted.
- Skip-based algorithms using geometric distributions to jump over items, useful when skipping is cheap or for low-power devices.

Composition and Fairness

Question about “composing” reservoir sampling (service + collector): one view says yes in principle, another clarifies that interval boundaries and details matter.
The priority-based view makes it easier to reason about composition: if you preserve the global top‑k priorities, composition is valid.

Logging / Observability Discussion

Reservoir sampling seen as a way to:
- Protect centralized log/metrics systems under bursty load.
- Avoid blind spots from naive rate limiting.
Concerns:
- Simple per-interval sampling overrepresents slow periods and underrepresents bursty services, making it bad for some optimization/capacity planning metrics.
- Suggested mitigations: source aggregation, reweighting counts, biasing selection by importance, head/tail sampling, per-level reservoirs.
Some prefer domain-aware dropping/aggregation first, then random culling as last resort.

Statistics & Sampling Nuances

Emphasis that reservoir sampling gives an unbiased sample (for the right variant), but downstream statistical interpretation can still be tricky.
Anecdotes about fabricated environmental/tourism stats highlight the difference between sound sampling and bad (or dishonest) data.
Brief side discussions on German tank estimation, aliasing/sampling theorem, and the need to track sampling rates so aggregates can be reconstructed.

Article Design & Interactivity

Strong praise for:
- Clear, playful visualizations and interactive simulations (including the physics “hero” using reservoir sampling).
- Accessibility: colorblind‑friendly palette, testing with browser color filters.
- Thoughtful details (artwork, logs, small jokes).
Compared favorably to interactive explainers like Distill, Bret Victor, and others; several readers say the site feels like a “treasure trove.”

Related topics