My Favorite Algorithm: Linear Time Median Finding (2018)

Linear-Time Selection Algorithms

  • Discussion centers on quickselect vs. deterministic median-of-medians (BFPRT) and variants like Floyd–Rivest and introselect.
  • Several note that median-of-medians is elegant but has large constants; often slower than randomized quickselect unless inputs are huge or worst-case guarantees matter.
  • Others point to newer work (e.g., Alexandrescu-style selection, practical benchmarks and libraries) that make deterministic selection competitive in practice.
  • Some mention specialized optimizations when selecting extreme quantiles (e.g., biased pivots, “j-th of k-th”, Floyd–Rivest-style tweaks).

Radix Sort and String / Integer Sorting

  • Debate over whether radix sort is “O(kN)” or “O(N+M)” for strings; one side defines M as total character count, making it linear in input size.
  • Practical comments: radix sort can be extremely fast, especially for fixed-size integers or strings, but comparison sorts often win on CPUs unless N is large or k is small.
  • Links to optimized parallel/string radix sorts and cutting-edge variants; concern over recursion depth and cache behavior for MSD vs LSD approaches.

Streaming and Approximate Quantiles

  • Many references to online median/quantile estimators: remedian, p², FAME, Greenwald–Khanna, t-digest, histogram-based methods (log/exponential, HDR, ddsketch, hg64).
  • Tradeoffs discussed: memory bounds, quantile vs value error, stationary vs rolling quantiles, relative vs absolute error, and applicability to anomaly detection and monitoring.
  • Some note that approximations are often “good enough”, especially in observability systems; others emphasize hard questions about error bounds and assumptions.

Interviews and Real-World Constraints

  • Multiple stories of interview questions about medians of huge streams or files; criticism that expected “optimal” heap or selection algorithms are unrealistic in 30 minutes.
  • Suggestions of simpler multi-pass or radix/bucket-style solutions for 32-bit integers under tight memory or streaming constraints.
  • Strong skepticism about “PhD-level” whiteboard questions, with some arguing they select for leetcode practice rather than job-relevant skills.

Complexity, Proofs, and Big-O Nuances

  • Several alternative proofs that median-of-medians recurrence is O(n), including superadditivity and induction-based bounds.
  • Discussion of average vs worst-case quickselect behavior, the effect of skewed distributions, and whether using the mean as pivot helps (often not).
  • Philosophical notes: theoretical CS cares about problems and bounds; existence of an O(n) algorithm changes research focus even if the first such algorithm is impractical.
  • Side debate on big-O vs real-world limits: in practice all programs on finite machines are O(1), but big-O remains useful for understanding scaling.

Implementation Details and Miscellany

  • Nitpicks about the article’s Python code (float vs integer division, variable naming) and about claiming small-n fallbacks as “constant time”.
  • Explanation of why groups of 5 are used in BFPRT and how groups of 3 or 4 can be adapted while keeping linear complexity.
  • Related “favorite algorithms” mentioned: alias method for O(1) weighted sampling and parallelizable merge techniques.