OpenZL: An open source format-aware compression framework

Overview and Release Artifacts

  • Alongside the blog post, code, docs, and a white paper were published.
  • OpenZL is BSD-licensed, written in C++, and positioned as a general framework for format-aware compression rather than a single “universal” compressor.

Core Idea: Format-Aware Graphs & SDDL

  • Users describe data structure (columns, types, layout) via SDDL or custom C++/Python tokenizers.
  • Compressor builds a DAG of transformations per stream, then uses zstd-like entropy coding on the transformed streams.
  • Decompression is format-agnostic: only the learned graph/DAG is shipped, not the tokenizer code.

Performance, Benchmarks, and Comparisons

  • On highly structured / numeric / columnar data (e.g., Parquet, Meta’s Nimble backend) OpenZL reportedly far outperforms zstd and xz.
  • It is not expected to shine on generic text or unknown formats; the “serial” profile just falls back to zstd.
  • One user saw worse compression on a CSV vs ZIP and also hit an internal error with a custom profile; maintainers requested a bug report.
  • For PCM audio, OpenZL beat zstd but not FLAC; maintainers note they lack FLAC-style predictors today and don’t expect to beat top specialized codecs.

Use Cases and Domain Interest

  • Strong interest around genomics (FASTA/BAM/CRAM, nanopore formats), with discussion moved to a GitHub issue; expectation is it can beat plain zstd but needs extra work to rival CRAM.
  • Other suggested domains: GPU texture formats (BCn), HDF5, JSON/BSON, logs, archive/container nested formats, and network captures with interleaved substreams.
  • For JSON/log-like data, OpenZL should work well if a tokenizer is written and numeric data is converted from text; floats are called out as hard to transform losslessly.

Tooling, Limitations, and Roadmap

  • CLI requires explicit profiles (--profile) such as csv, parquet, or le-u64; training is supported but can’t yet “learn” complex container formats like tar.
  • Current limitations: no indexable/seekable format yet (planned), chunking/streaming still in development, and files >2 GiB currently hit a “chunking required” error.
  • Python bindings are included; other bindings are anticipated.

Prior Art, Security, and Automation

  • Thread cites related ideas: 7‑Zip filters, ZPAQ with embedded decoders, XML EXI, F3+WASM, image codecs (Basis, PNG), and deep-learning weight compression.
  • Some argue WASM-based embedded decoders raise determinism and security questions; OpenZL’s non–Turing complete graphs avoid shipping arbitrary code.
  • Multiple commenters propose generating SDDL automatically from samples or existing schema languages (Kaitai, imhex, GNU poke) and possibly via LLMs.
  • Patent status and some finer details (e.g., DAG encoding) are acknowledged as either intentionally omitted or not yet stable; patent status remains unclear in the thread.