2025-10-22

An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

Use cases and problem variants

Several commenters say sort | uniq -c-style histograms are common in bioinformatics, log analysis, CSV inspection, and ETL tasks, where file sizes can be tens of GB.
Others want related but different operations:
- Deduplicate while preserving original order (keep first or last occurrence).
- Count unique lines without sorting at all.
- “Fuzzy” grouping of near-duplicate lines (e.g., log lines differing only by timestamp), which is acknowledged as harder.

Algorithms, ordering, and memory trade-offs

Coreutils sort | uniq -c | sort -n is disk-based and low-memory by design, so it scales to huge inputs but is slower.
The Rust tool uses a hash map: faster but bounded by RAM, especially if most lines are unique.
Discussion of order-preserving dedupe:
- Keeping first occurrence is straightforward with a hash set.
- Keeping last requires two passes or extra data structures (e.g., track last index, then sort those; or self-sorting structures).
- Simple CLI tricks like reverse → dedupe-first → reverse are suggested.
Alternative data structures (tries, cuckoo filters, Bloom/sketch-based tools) are mentioned as ways to reduce memory or do approximate dedupe, with trade-offs in counts and false positives.

Benchmarking methodology and alternatives

Multiple people question the benchmark:
- Random FASTQ with almost no duplicates primarily stresses hash-table growth, not realistic “histogram” workloads.
- Coreutils sort can be sped up with larger --buffer-size and parallelism flags.
- Removing an unnecessary cat already improves the naïve baseline.
Several other approaches are proposed and partially benchmarked:
- awk '{ x[$0]++ } END { for (y in x) print y, x[y] }' | sort -k2,2nr
- awk/Perl one-pass dedupe (!a[$0]++), especially when ordering doesn’t matter.
- Rust uutils ports of sort/uniq, which can outperform GNU coreutils in some tests.

Tooling comparisons and performance ideas

clickhouse-local is demonstrated as dramatically faster than both coreutils and the Rust tool for this task, but:
- Some argue comparisons should consider single-threaded vs. multi-threaded fairness.
- Others respond that parallelism is a legitimate advantage, not something to “turn off” for fairness.
Further Rust micro-optimizations are suggested (faster hashing and string sorting libraries; avoiding slow println!; using a CSV writer for high-throughput output).

Overengineering, optimization, and developer time

One thread debates the term “overengineered”:
- Some argue it’s just “engineered” to different requirements (throughput vs. flexibility).
- Overengineering is framed as overshooting realistic requirements with excessive effort, not merely optimizing.
A related sub-discussion contrasts:
- Reusing standard tools (sort, uniq) vs. rewriting in Python.
- Whether rejecting candidates for not knowing shell one-liners is sensible.
- The value of shared, standard tools versus private utility libraries.
- The pragmatic reality that LLMs now help both write and understand code or shell pipelines.

Security and installation concerns

A side debate arises around clickhouse’s suggested curl ... | sh installation:
- Some see it as equivalent in risk to downloading and executing a binary.
- Others call it an anti-pattern, arguing distro-packaged binaries and signatures offer stronger supply-chain protection.
- Comparisons are made to other ecosystems’ supply-chain issues (e.g., npm incidents), reinforcing general unease about arbitrary remote code execution.

Related topics