An overengineered solution to `sort | uniq -c` with 25x throughput (hist)
Use cases and problem variants
- Several commenters say
sort | uniq -c-style histograms are common in bioinformatics, log analysis, CSV inspection, and ETL tasks, where file sizes can be tens of GB. - Others want related but different operations:
- Deduplicate while preserving original order (keep first or last occurrence).
- Count unique lines without sorting at all.
- “Fuzzy” grouping of near-duplicate lines (e.g., log lines differing only by timestamp), which is acknowledged as harder.
Algorithms, ordering, and memory trade-offs
- Coreutils
sort | uniq -c | sort -nis disk-based and low-memory by design, so it scales to huge inputs but is slower. - The Rust tool uses a hash map: faster but bounded by RAM, especially if most lines are unique.
- Discussion of order-preserving dedupe:
- Keeping first occurrence is straightforward with a hash set.
- Keeping last requires two passes or extra data structures (e.g., track last index, then sort those; or self-sorting structures).
- Simple CLI tricks like reverse → dedupe-first → reverse are suggested.
- Alternative data structures (tries, cuckoo filters, Bloom/sketch-based tools) are mentioned as ways to reduce memory or do approximate dedupe, with trade-offs in counts and false positives.
Benchmarking methodology and alternatives
- Multiple people question the benchmark:
- Random FASTQ with almost no duplicates primarily stresses hash-table growth, not realistic “histogram” workloads.
- Coreutils
sortcan be sped up with larger--buffer-sizeand parallelism flags. - Removing an unnecessary
catalready improves the naïve baseline.
- Several other approaches are proposed and partially benchmarked:
awk '{ x[$0]++ } END { for (y in x) print y, x[y] }' | sort -k2,2nrawk/Perl one-pass dedupe (!a[$0]++), especially when ordering doesn’t matter.- Rust uutils ports of
sort/uniq, which can outperform GNU coreutils in some tests.
Tooling comparisons and performance ideas
- clickhouse-local is demonstrated as dramatically faster than both coreutils and the Rust tool for this task, but:
- Some argue comparisons should consider single-threaded vs. multi-threaded fairness.
- Others respond that parallelism is a legitimate advantage, not something to “turn off” for fairness.
- Further Rust micro-optimizations are suggested (faster hashing and string sorting libraries; avoiding slow
println!; using a CSV writer for high-throughput output).
Overengineering, optimization, and developer time
- One thread debates the term “overengineered”:
- Some argue it’s just “engineered” to different requirements (throughput vs. flexibility).
- Overengineering is framed as overshooting realistic requirements with excessive effort, not merely optimizing.
- A related sub-discussion contrasts:
- Reusing standard tools (
sort,uniq) vs. rewriting in Python. - Whether rejecting candidates for not knowing shell one-liners is sensible.
- The value of shared, standard tools versus private utility libraries.
- The pragmatic reality that LLMs now help both write and understand code or shell pipelines.
- Reusing standard tools (
Security and installation concerns
- A side debate arises around clickhouse’s suggested
curl ... | shinstallation:- Some see it as equivalent in risk to downloading and executing a binary.
- Others call it an anti-pattern, arguing distro-packaged binaries and signatures offer stronger supply-chain protection.
- Comparisons are made to other ecosystems’ supply-chain issues (e.g., npm incidents), reinforcing general unease about arbitrary remote code execution.