2025-02-01

Bzip3: A spiritual successor to BZip2

Benchmarks & Performance

Multiple independent benchmarks were shared:
- On a large text file and a Linux disk image, bzip3 achieved slightly better compression ratios than zstd/xz but was often much slower, especially on decompression, and used far more RAM (e.g., ~18 GB vs single‑digit MB for bzip2).
- One user’s SQL benchmark found bzip3 compressing better than zstd for similar compression time, but decompression was ~20× slower, contradicting the README’s claims and raising suspicion about cherry-picked inputs and HDD-skewed results.
Enabling zstd’s --long and higher levels (up to -22) often made zstd competitive or superior on the same datasets.

Benchmark Design & Long‑Range Redundancy

The headline Perl-source benchmark (many similar versions) is seen as a “lowlight”:
- It heavily favors algorithms that exploit long-range redundancy across near-duplicate files.
- Others show zstd and even rar+lzip-style tools doing extremely well once long-window parameters are tuned.
Several argue that such a corpus is unrepresentative for typical use; corpus benchmarks later in the README are viewed as more realistic.
Discussion notes that BWT-based schemes shine on codebases with many similar files; suggestions include sorting files by extension/name before archiving to help any compressor.

Algorithm Focus & Design Choices

The author states bzip3 is intended as a modern replacement for bzip2:
- BWT-based, text-leaning, with much larger block sizes and built‑in parallelism.
- Uses arithmetic coding and context mixing; designed for modern CPUs with more RAM/cache.
Clarifications that LZ and BWT tend to excel on different data (binary vs “textual”).

Burrows–Wheeler Transform (BWT) Discussion

Many express near-awe at BWT’s “magic,” especially its reversibility.
Several detailed explanations:
- BWT clusters symbols sharing the same following context, turning n‑gram structure into runs that RLE + entropy coding can exploit.
- It’s closely related to suffix trees/arrays and conceptually similar to high‑order PPM models but with an implicit model.
- Huffman/ANS need a model; BWT provides an efficient high‑order model, making low-order predictors behave like high-order ones.

Naming, Compatibility, and Ecosystem

Some dislike the “bzip3” name as easily confused with bzip2 and not wire-compatible, preferring a more distinct name.
Others argue an incompatible format merits a new major version number; confusion is the trade‑off.

Reliability, Backups, and Warnings

The README’s explicit warning about possible unrecoverable data makes some hesitant to use bzip3 for backups.
Others note virtually all OSS licenses disclaim warranty; reliability must be established via testing (e.g., compress–decompress–verify loops).
Past reports of bzip2 data loss and lzip’s focus on recoverability are mentioned; some have switched xz → lzip for that reason.

gzip vs zstd and Practical Adoption

Several contend zstd dominates gzip on speed and ratio at all points, recommending zstd (or lz4 for ultra-fast) except where backward compatibility is paramount.
Others stick with gzip for near-universal availability, tooling (zcat, zless, zgrep), and long-term “Lindy” stability.
Concerns about how widely zstd and related tools are installed by default across OSes; some will wait for zstd integration into ecosystems (e.g., Python stdlib) before switching.

Other Tools, Features, and Omissions

Some ask why lzip wasn’t benchmarked; they see it as a natural comparison point.
A feature request: store uncompressed size in headers (as gzip does); debate follows about zip bombs vs easier integrity checking.
Interest in better “long‑range” compression algorithms beyond large-window LZ/BWT and deduplication; this is seen as a promising but under‑researched area.

Related topics