Bzip3: A spiritual successor to BZip2

Benchmarks & Performance

  • Multiple independent benchmarks were shared:
    • On a large text file and a Linux disk image, bzip3 achieved slightly better compression ratios than zstd/xz but was often much slower, especially on decompression, and used far more RAM (e.g., ~18 GB vs single‑digit MB for bzip2).
    • One user’s SQL benchmark found bzip3 compressing better than zstd for similar compression time, but decompression was ~20× slower, contradicting the README’s claims and raising suspicion about cherry-picked inputs and HDD-skewed results.
  • Enabling zstd’s --long and higher levels (up to -22) often made zstd competitive or superior on the same datasets.

Benchmark Design & Long‑Range Redundancy

  • The headline Perl-source benchmark (many similar versions) is seen as a “lowlight”:
    • It heavily favors algorithms that exploit long-range redundancy across near-duplicate files.
    • Others show zstd and even rar+lzip-style tools doing extremely well once long-window parameters are tuned.
  • Several argue that such a corpus is unrepresentative for typical use; corpus benchmarks later in the README are viewed as more realistic.
  • Discussion notes that BWT-based schemes shine on codebases with many similar files; suggestions include sorting files by extension/name before archiving to help any compressor.

Algorithm Focus & Design Choices

  • The author states bzip3 is intended as a modern replacement for bzip2:
    • BWT-based, text-leaning, with much larger block sizes and built‑in parallelism.
    • Uses arithmetic coding and context mixing; designed for modern CPUs with more RAM/cache.
  • Clarifications that LZ and BWT tend to excel on different data (binary vs “textual”).

Burrows–Wheeler Transform (BWT) Discussion

  • Many express near-awe at BWT’s “magic,” especially its reversibility.
  • Several detailed explanations:
    • BWT clusters symbols sharing the same following context, turning n‑gram structure into runs that RLE + entropy coding can exploit.
    • It’s closely related to suffix trees/arrays and conceptually similar to high‑order PPM models but with an implicit model.
    • Huffman/ANS need a model; BWT provides an efficient high‑order model, making low-order predictors behave like high-order ones.

Naming, Compatibility, and Ecosystem

  • Some dislike the “bzip3” name as easily confused with bzip2 and not wire-compatible, preferring a more distinct name.
  • Others argue an incompatible format merits a new major version number; confusion is the trade‑off.

Reliability, Backups, and Warnings

  • The README’s explicit warning about possible unrecoverable data makes some hesitant to use bzip3 for backups.
  • Others note virtually all OSS licenses disclaim warranty; reliability must be established via testing (e.g., compress–decompress–verify loops).
  • Past reports of bzip2 data loss and lzip’s focus on recoverability are mentioned; some have switched xz → lzip for that reason.

gzip vs zstd and Practical Adoption

  • Several contend zstd dominates gzip on speed and ratio at all points, recommending zstd (or lz4 for ultra-fast) except where backward compatibility is paramount.
  • Others stick with gzip for near-universal availability, tooling (zcat, zless, zgrep), and long-term “Lindy” stability.
  • Concerns about how widely zstd and related tools are installed by default across OSes; some will wait for zstd integration into ecosystems (e.g., Python stdlib) before switching.

Other Tools, Features, and Omissions

  • Some ask why lzip wasn’t benchmarked; they see it as a natural comparison point.
  • A feature request: store uncompressed size in headers (as gzip does); debate follows about zip bombs vs easier integrity checking.
  • Interest in better “long‑range” compression algorithms beyond large-window LZ/BWT and deduplication; this is seen as a promising but under‑researched area.