OpenZL: An open source format-aware compression framework
Overview and Release Artifacts
- Alongside the blog post, code, docs, and a white paper were published.
- OpenZL is BSD-licensed, written in C++, and positioned as a general framework for format-aware compression rather than a single “universal” compressor.
Core Idea: Format-Aware Graphs & SDDL
- Users describe data structure (columns, types, layout) via SDDL or custom C++/Python tokenizers.
- Compressor builds a DAG of transformations per stream, then uses zstd-like entropy coding on the transformed streams.
- Decompression is format-agnostic: only the learned graph/DAG is shipped, not the tokenizer code.
Performance, Benchmarks, and Comparisons
- On highly structured / numeric / columnar data (e.g., Parquet, Meta’s Nimble backend) OpenZL reportedly far outperforms zstd and xz.
- It is not expected to shine on generic text or unknown formats; the “serial” profile just falls back to zstd.
- One user saw worse compression on a CSV vs ZIP and also hit an internal error with a custom profile; maintainers requested a bug report.
- For PCM audio, OpenZL beat zstd but not FLAC; maintainers note they lack FLAC-style predictors today and don’t expect to beat top specialized codecs.
Use Cases and Domain Interest
- Strong interest around genomics (FASTA/BAM/CRAM, nanopore formats), with discussion moved to a GitHub issue; expectation is it can beat plain zstd but needs extra work to rival CRAM.
- Other suggested domains: GPU texture formats (BCn), HDF5, JSON/BSON, logs, archive/container nested formats, and network captures with interleaved substreams.
- For JSON/log-like data, OpenZL should work well if a tokenizer is written and numeric data is converted from text; floats are called out as hard to transform losslessly.
Tooling, Limitations, and Roadmap
- CLI requires explicit profiles (
--profile) such as csv, parquet, or le-u64; training is supported but can’t yet “learn” complex container formats like tar. - Current limitations: no indexable/seekable format yet (planned), chunking/streaming still in development, and files >2 GiB currently hit a “chunking required” error.
- Python bindings are included; other bindings are anticipated.
Prior Art, Security, and Automation
- Thread cites related ideas: 7‑Zip filters, ZPAQ with embedded decoders, XML EXI, F3+WASM, image codecs (Basis, PNG), and deep-learning weight compression.
- Some argue WASM-based embedded decoders raise determinism and security questions; OpenZL’s non–Turing complete graphs avoid shipping arbitrary code.
- Multiple commenters propose generating SDDL automatically from samples or existing schema languages (Kaitai, imhex, GNU poke) and possibly via LLMs.
- Patent status and some finer details (e.g., DAG encoding) are acknowledged as either intentionally omitted or not yet stable; patent status remains unclear in the thread.