2025-09-12

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

Why removing newlines helps so much

FASTA sequence lines are hard‑wrapped (e.g., every 60 bases) with non‑semantic newlines.
Related bacterial genomes share long subsequences, but line breaks occur at different offsets, so identical regions are “out of phase”.
Zstd’s long‑distance matcher uses fixed‑length (e.g., 64‑byte) windows; periodic newlines break those windows, making otherwise-identical substrings appear different.
Stripping the wrapping newlines yields contiguous base strings, restoring long repeated runs and enabling vastly better matches.

Behavior and limits of general-purpose compressors

Zstd is explicitly byte‑oriented and unaware of domain semantics; it doesn’t try to realign sequences or reinterpret framing.
BWT‑based compressors (e.g., bzip2) often do better on “many similar strings with mutations” than LZ‑only schemes, but are much slower and less parallel‑friendly.
Some compressors or filters can operate on sub-byte or structured streams, but general‑purpose tools usually use bytes (sometimes 32‑bit words) as their basic unit.

Window size, --long, and safety concerns

Large Zstd windows (--long) dramatically improve compression on huge, repetitive datasets (like many genomes) by exposing more cross‑sequence redundancy.
Required window size is stored in metadata, but support beyond 8 MiB isn’t guaranteed; users must opt in via --long to signal they accept higher RAM use.
Very large windows raise denial‑of‑service risks (high decompression memory), so auto‑honoring arbitrary window sizes from untrusted inputs is discouraged.

Dictionaries, filters, and preprocessing

A FASTA‑specific dictionary would likely help but mainly at the start of the stream; its marginal benefit falls as data size grows and the adaptive dictionary dominates.
Preprocessing steps (e.g., stripping fixed‑interval punctuation, separating FASTQ lines into streams, PNG‑style filters) are proposed as a general pattern: expose the “true” structure to the compressor while inverting the transform on decode.

Debate over FASTA/FASTQ and bioinformatics culture

Some commenters call FASTA/FASTQ “stupid” or inefficient; others argue they are simple, robust, and historically appropriate (1980s terminals, line‑length limits).
Text formats persist because:
- trivial to parse/write by novices,
- universally supported across tools and decades,
- better for archival and interoperability than a proliferation of competing binaries.
Critics counter that the field rarely “graduates” beyond novice‑friendly standards, and that lack of tooling/funding keeps better formats from taking over.

Alternatives and specialized genomic compression

Many note that domain‑specific approaches (2‑bit encodings, BWT/FM‑index–based tools, CRAM, FASTQ‑specific compressors) can far outperform generic zstd/gzip.
Columnar formats (Arrow/Parquet), BGZF‑wrapped gzip, and reference‑based compression are cited as practical improvements when moving beyond plain FASTA/FASTQ text.

Related topics