Bzip3: A spiritual successor to BZip2
Benchmarks & Performance
- Multiple independent benchmarks were shared:
- On a large text file and a Linux disk image, bzip3 achieved slightly better compression ratios than zstd/xz but was often much slower, especially on decompression, and used far more RAM (e.g., ~18 GB vs single‑digit MB for bzip2).
- One user’s SQL benchmark found bzip3 compressing better than zstd for similar compression time, but decompression was ~20× slower, contradicting the README’s claims and raising suspicion about cherry-picked inputs and HDD-skewed results.
- Enabling zstd’s
--longand higher levels (up to-22) often made zstd competitive or superior on the same datasets.
Benchmark Design & Long‑Range Redundancy
- The headline Perl-source benchmark (many similar versions) is seen as a “lowlight”:
- It heavily favors algorithms that exploit long-range redundancy across near-duplicate files.
- Others show zstd and even rar+lzip-style tools doing extremely well once long-window parameters are tuned.
- Several argue that such a corpus is unrepresentative for typical use; corpus benchmarks later in the README are viewed as more realistic.
- Discussion notes that BWT-based schemes shine on codebases with many similar files; suggestions include sorting files by extension/name before archiving to help any compressor.
Algorithm Focus & Design Choices
- The author states bzip3 is intended as a modern replacement for bzip2:
- BWT-based, text-leaning, with much larger block sizes and built‑in parallelism.
- Uses arithmetic coding and context mixing; designed for modern CPUs with more RAM/cache.
- Clarifications that LZ and BWT tend to excel on different data (binary vs “textual”).
Burrows–Wheeler Transform (BWT) Discussion
- Many express near-awe at BWT’s “magic,” especially its reversibility.
- Several detailed explanations:
- BWT clusters symbols sharing the same following context, turning n‑gram structure into runs that RLE + entropy coding can exploit.
- It’s closely related to suffix trees/arrays and conceptually similar to high‑order PPM models but with an implicit model.
- Huffman/ANS need a model; BWT provides an efficient high‑order model, making low-order predictors behave like high-order ones.
Naming, Compatibility, and Ecosystem
- Some dislike the “bzip3” name as easily confused with bzip2 and not wire-compatible, preferring a more distinct name.
- Others argue an incompatible format merits a new major version number; confusion is the trade‑off.
Reliability, Backups, and Warnings
- The README’s explicit warning about possible unrecoverable data makes some hesitant to use bzip3 for backups.
- Others note virtually all OSS licenses disclaim warranty; reliability must be established via testing (e.g., compress–decompress–verify loops).
- Past reports of bzip2 data loss and lzip’s focus on recoverability are mentioned; some have switched xz → lzip for that reason.
gzip vs zstd and Practical Adoption
- Several contend zstd dominates gzip on speed and ratio at all points, recommending zstd (or lz4 for ultra-fast) except where backward compatibility is paramount.
- Others stick with gzip for near-universal availability, tooling (
zcat,zless,zgrep), and long-term “Lindy” stability. - Concerns about how widely zstd and related tools are installed by default across OSes; some will wait for zstd integration into ecosystems (e.g., Python stdlib) before switching.
Other Tools, Features, and Omissions
- Some ask why lzip wasn’t benchmarked; they see it as a natural comparison point.
- A feature request: store uncompressed size in headers (as gzip does); debate follows about zip bombs vs easier integrity checking.
- Interest in better “long‑range” compression algorithms beyond large-window LZ/BWT and deduplication; this is seen as a promising but under‑researched area.