2024-07-25

My Favorite Algorithm: Linear Time Median Finding (2018)

Linear-Time Selection Algorithms

Discussion centers on quickselect vs. deterministic median-of-medians (BFPRT) and variants like Floyd–Rivest and introselect.
Several note that median-of-medians is elegant but has large constants; often slower than randomized quickselect unless inputs are huge or worst-case guarantees matter.
Others point to newer work (e.g., Alexandrescu-style selection, practical benchmarks and libraries) that make deterministic selection competitive in practice.
Some mention specialized optimizations when selecting extreme quantiles (e.g., biased pivots, “j-th of k-th”, Floyd–Rivest-style tweaks).

Radix Sort and String / Integer Sorting

Debate over whether radix sort is “O(kN)” or “O(N+M)” for strings; one side defines M as total character count, making it linear in input size.
Practical comments: radix sort can be extremely fast, especially for fixed-size integers or strings, but comparison sorts often win on CPUs unless N is large or k is small.
Links to optimized parallel/string radix sorts and cutting-edge variants; concern over recursion depth and cache behavior for MSD vs LSD approaches.

Streaming and Approximate Quantiles

Many references to online median/quantile estimators: remedian, p², FAME, Greenwald–Khanna, t-digest, histogram-based methods (log/exponential, HDR, ddsketch, hg64).
Tradeoffs discussed: memory bounds, quantile vs value error, stationary vs rolling quantiles, relative vs absolute error, and applicability to anomaly detection and monitoring.
Some note that approximations are often “good enough”, especially in observability systems; others emphasize hard questions about error bounds and assumptions.

Interviews and Real-World Constraints

Multiple stories of interview questions about medians of huge streams or files; criticism that expected “optimal” heap or selection algorithms are unrealistic in 30 minutes.
Suggestions of simpler multi-pass or radix/bucket-style solutions for 32-bit integers under tight memory or streaming constraints.
Strong skepticism about “PhD-level” whiteboard questions, with some arguing they select for leetcode practice rather than job-relevant skills.

Complexity, Proofs, and Big-O Nuances

Several alternative proofs that median-of-medians recurrence is O(n), including superadditivity and induction-based bounds.
Discussion of average vs worst-case quickselect behavior, the effect of skewed distributions, and whether using the mean as pivot helps (often not).
Philosophical notes: theoretical CS cares about problems and bounds; existence of an O(n) algorithm changes research focus even if the first such algorithm is impractical.
Side debate on big-O vs real-world limits: in practice all programs on finite machines are O(1), but big-O remains useful for understanding scaling.

Implementation Details and Miscellany

Nitpicks about the article’s Python code (float vs integer division, variable naming) and about claiming small-n fallbacks as “constant time”.
Explanation of why groups of 5 are used in BFPRT and how groups of 3 or 4 can be adapted while keeping linear complexity.
Related “favorite algorithms” mentioned: alias method for O(1) weighted sampling and parallelizable merge techniques.

Related topics