My Favorite Algorithm: Linear Time Median Finding (2018)
Linear-Time Selection Algorithms
- Discussion centers on quickselect vs. deterministic median-of-medians (BFPRT) and variants like Floyd–Rivest and introselect.
- Several note that median-of-medians is elegant but has large constants; often slower than randomized quickselect unless inputs are huge or worst-case guarantees matter.
- Others point to newer work (e.g., Alexandrescu-style selection, practical benchmarks and libraries) that make deterministic selection competitive in practice.
- Some mention specialized optimizations when selecting extreme quantiles (e.g., biased pivots, “j-th of k-th”, Floyd–Rivest-style tweaks).
Radix Sort and String / Integer Sorting
- Debate over whether radix sort is “O(kN)” or “O(N+M)” for strings; one side defines M as total character count, making it linear in input size.
- Practical comments: radix sort can be extremely fast, especially for fixed-size integers or strings, but comparison sorts often win on CPUs unless N is large or k is small.
- Links to optimized parallel/string radix sorts and cutting-edge variants; concern over recursion depth and cache behavior for MSD vs LSD approaches.
Streaming and Approximate Quantiles
- Many references to online median/quantile estimators: remedian, p², FAME, Greenwald–Khanna, t-digest, histogram-based methods (log/exponential, HDR, ddsketch, hg64).
- Tradeoffs discussed: memory bounds, quantile vs value error, stationary vs rolling quantiles, relative vs absolute error, and applicability to anomaly detection and monitoring.
- Some note that approximations are often “good enough”, especially in observability systems; others emphasize hard questions about error bounds and assumptions.
Interviews and Real-World Constraints
- Multiple stories of interview questions about medians of huge streams or files; criticism that expected “optimal” heap or selection algorithms are unrealistic in 30 minutes.
- Suggestions of simpler multi-pass or radix/bucket-style solutions for 32-bit integers under tight memory or streaming constraints.
- Strong skepticism about “PhD-level” whiteboard questions, with some arguing they select for leetcode practice rather than job-relevant skills.
Complexity, Proofs, and Big-O Nuances
- Several alternative proofs that median-of-medians recurrence is O(n), including superadditivity and induction-based bounds.
- Discussion of average vs worst-case quickselect behavior, the effect of skewed distributions, and whether using the mean as pivot helps (often not).
- Philosophical notes: theoretical CS cares about problems and bounds; existence of an O(n) algorithm changes research focus even if the first such algorithm is impractical.
- Side debate on big-O vs real-world limits: in practice all programs on finite machines are O(1), but big-O remains useful for understanding scaling.
Implementation Details and Miscellany
- Nitpicks about the article’s Python code (float vs integer division, variable naming) and about claiming small-n fallbacks as “constant time”.
- Explanation of why groups of 5 are used in BFPRT and how groups of 3 or 4 can be adapted while keeping linear complexity.
- Related “favorite algorithms” mentioned: alias method for O(1) weighted sampling and parallelizable merge techniques.