2025-10-06

OpenZL: An open source format-aware compression framework

Overview and Release Artifacts

Alongside the blog post, code, docs, and a white paper were published.
OpenZL is BSD-licensed, written in C++, and positioned as a general framework for format-aware compression rather than a single “universal” compressor.

Core Idea: Format-Aware Graphs & SDDL

Users describe data structure (columns, types, layout) via SDDL or custom C++/Python tokenizers.
Compressor builds a DAG of transformations per stream, then uses zstd-like entropy coding on the transformed streams.
Decompression is format-agnostic: only the learned graph/DAG is shipped, not the tokenizer code.

Performance, Benchmarks, and Comparisons

On highly structured / numeric / columnar data (e.g., Parquet, Meta’s Nimble backend) OpenZL reportedly far outperforms zstd and xz.
It is not expected to shine on generic text or unknown formats; the “serial” profile just falls back to zstd.
One user saw worse compression on a CSV vs ZIP and also hit an internal error with a custom profile; maintainers requested a bug report.
For PCM audio, OpenZL beat zstd but not FLAC; maintainers note they lack FLAC-style predictors today and don’t expect to beat top specialized codecs.

Use Cases and Domain Interest

Strong interest around genomics (FASTA/BAM/CRAM, nanopore formats), with discussion moved to a GitHub issue; expectation is it can beat plain zstd but needs extra work to rival CRAM.
Other suggested domains: GPU texture formats (BCn), HDF5, JSON/BSON, logs, archive/container nested formats, and network captures with interleaved substreams.
For JSON/log-like data, OpenZL should work well if a tokenizer is written and numeric data is converted from text; floats are called out as hard to transform losslessly.

Tooling, Limitations, and Roadmap

CLI requires explicit profiles (--profile) such as csv, parquet, or le-u64; training is supported but can’t yet “learn” complex container formats like tar.
Current limitations: no indexable/seekable format yet (planned), chunking/streaming still in development, and files >2 GiB currently hit a “chunking required” error.
Python bindings are included; other bindings are anticipated.

Prior Art, Security, and Automation

Thread cites related ideas: 7‑Zip filters, ZPAQ with embedded decoders, XML EXI, F3+WASM, image codecs (Basis, PNG), and deep-learning weight compression.
Some argue WASM-based embedded decoders raise determinism and security questions; OpenZL’s non–Turing complete graphs avoid shipping arbitrary code.
Multiple commenters propose generating SDDL automatically from samples or existing schema languages (Kaitai, imhex, GNU poke) and possibly via LLMs.
Patent status and some finer details (e.g., DAG encoding) are acknowledged as either intentionally omitted or not yet stable; patent status remains unclear in the thread.

Related topics