An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

Use cases and problem variants

  • Several commenters say sort | uniq -c-style histograms are common in bioinformatics, log analysis, CSV inspection, and ETL tasks, where file sizes can be tens of GB.
  • Others want related but different operations:
    • Deduplicate while preserving original order (keep first or last occurrence).
    • Count unique lines without sorting at all.
    • “Fuzzy” grouping of near-duplicate lines (e.g., log lines differing only by timestamp), which is acknowledged as harder.

Algorithms, ordering, and memory trade-offs

  • Coreutils sort | uniq -c | sort -n is disk-based and low-memory by design, so it scales to huge inputs but is slower.
  • The Rust tool uses a hash map: faster but bounded by RAM, especially if most lines are unique.
  • Discussion of order-preserving dedupe:
    • Keeping first occurrence is straightforward with a hash set.
    • Keeping last requires two passes or extra data structures (e.g., track last index, then sort those; or self-sorting structures).
    • Simple CLI tricks like reverse → dedupe-first → reverse are suggested.
  • Alternative data structures (tries, cuckoo filters, Bloom/sketch-based tools) are mentioned as ways to reduce memory or do approximate dedupe, with trade-offs in counts and false positives.

Benchmarking methodology and alternatives

  • Multiple people question the benchmark:
    • Random FASTQ with almost no duplicates primarily stresses hash-table growth, not realistic “histogram” workloads.
    • Coreutils sort can be sped up with larger --buffer-size and parallelism flags.
    • Removing an unnecessary cat already improves the naïve baseline.
  • Several other approaches are proposed and partially benchmarked:
    • awk '{ x[$0]++ } END { for (y in x) print y, x[y] }' | sort -k2,2nr
    • awk/Perl one-pass dedupe (!a[$0]++), especially when ordering doesn’t matter.
    • Rust uutils ports of sort/uniq, which can outperform GNU coreutils in some tests.

Tooling comparisons and performance ideas

  • clickhouse-local is demonstrated as dramatically faster than both coreutils and the Rust tool for this task, but:
    • Some argue comparisons should consider single-threaded vs. multi-threaded fairness.
    • Others respond that parallelism is a legitimate advantage, not something to “turn off” for fairness.
  • Further Rust micro-optimizations are suggested (faster hashing and string sorting libraries; avoiding slow println!; using a CSV writer for high-throughput output).

Overengineering, optimization, and developer time

  • One thread debates the term “overengineered”:
    • Some argue it’s just “engineered” to different requirements (throughput vs. flexibility).
    • Overengineering is framed as overshooting realistic requirements with excessive effort, not merely optimizing.
  • A related sub-discussion contrasts:
    • Reusing standard tools (sort, uniq) vs. rewriting in Python.
    • Whether rejecting candidates for not knowing shell one-liners is sensible.
    • The value of shared, standard tools versus private utility libraries.
    • The pragmatic reality that LLMs now help both write and understand code or shell pipelines.

Security and installation concerns

  • A side debate arises around clickhouse’s suggested curl ... | sh installation:
    • Some see it as equivalent in risk to downloading and executing a binary.
    • Others call it an anti-pattern, arguing distro-packaged binaries and signatures offer stronger supply-chain protection.
    • Comparisons are made to other ecosystems’ supply-chain issues (e.g., npm incidents), reinforcing general unease about arbitrary remote code execution.